News Article · Jun 27, 2026 at 7:40 AM
3 min read 0
Member
AI Models Now Code for 19 Days Straight, But New Benchmarks Reveal Persistent Gaps
Industry #model routing #AI #benchmarks #coding #MirrorCode #Claude Opus 4.7 #Vercel Eve #Terraform MCP

AI Models Now Code for 19 Days Straight, But New Benchmarks Reveal Persistent Gaps

Epoch AI's MirrorCode benchmark reveals AI models can now program for 19 days straight, with Claude Opus 4.7 leading at 56% solve rate. Meanwhile, new tools like Weave's model router and Vercel's Eve framework aim to cut costs and simplify agent deployment.

Listen to this article 4 min

Epoch AI and METR have released MirrorCode, a benchmark that tests whether AI models can recreate complete programs from scratch without seeing the original source code. Claude Opus 4.7 leads with a 56 percent solve rate, but every model tested still fails on the most complex tasks.

One MirrorCode task cost $2,600 to run and kept an AI model working continuously for 19 days with no human involvement. The benchmark includes 25 target programs spanning Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each solution must exactly reproduce the original program's output, including hidden end-to-end tests the model never sees during development.

Claude Opus 4.7 Rebuilds Bioinformatics Toolkit in 14 Hours

Claude Opus 4.7 reimplemented gotree, a bioinformatics toolkit with roughly 16,000 lines of Go code and over 40 commands, in 14 hours at a cost of $251. A human engineer working without AI help would need 2 to 17 weeks for the same job, according to the researchers. In the overall rankings, GPT-5.5 followed at 44 percent and Gemini 3.1 Pro Preview at 32 percent.

  • Small programs like uuid or parseqsv are reliably reimplemented by all tested models.
  • Medium tasks show wide variance, with leading models passing 90 percent or more of tests even when they fail to fully reimplement the program.
  • Large tasks beat every model tested so far.
  • Leading models from a year ago would have scored only about 30 percent and been limited to simpler programs like a calendar utility.
  • Epoch AI has open-sourced the scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages.

Cost Pressures Drive New Routing and Agent Frameworks

As AI coding costs climb, developers are building tools to manage expenses. Weave, a company that writes nearly all its code with AI, released an open-source model router that plugs into coding agents like Claude Code, Codex, and Cursor. The router intelligently sends requests to the best model for each task, a response to cost spikes caused by tokenizer changes in Opus 4.7. Meanwhile, Vercel introduced Eve, an open-source framework for building, deploying, and operating AI agents in production. Eve uses a filesystem-based project structure to organize agent instructions, tools, skills, subagents, communication channels, and scheduled tasks, reducing the amount of supporting infrastructure developers need to implement.

HashiCorp also released a Terraform MCP server that helps AI agents make better infrastructure decisions using trusted organizational context and guardrails. The combination of longer-running autonomous coding tasks and new cost-management tools suggests the industry is moving toward more practical, production-ready AI development workflows. Epoch AI notes that while MirrorCode results were not dominated by memorization, they cannot rule out that memorization contributes to AI performance, a caveat that will shape future benchmark design.

Fact check

  • Claude Opus 4.7 leads MirrorCode with a 56 percent solve rate.

    reported · source

  • One MirrorCode task cost $2,600 to run and kept an AI model working continuously for 19 days.

    reported · source

  • Weave released an open-source model router that plugs into Claude Code, Codex, and Cursor.

    reported · source

  • Vercel introduced Eve, an open-source framework for building AI agents.

    reported · source

  • HashiCorp released a Terraform MCP server for AI infrastructure decisions.

    reported · source

Source reporting (4)

0 Comments

No comments yet

Be the first to share your thoughts on this article.

Join the conversation

You need to be registered and logged in to comment on blog articles.

Who Is Online

In total there are 906 users online: 0 registered, 899 guests and 7 bots.

Most users ever online was 3,441 on 27 Jun 2026, 6:02 am.

Bots: AhrefsBot Applebot Baiduspider Bingbot Other Bot Other Spider SemrushBot

Users active in the past 15 minutes. Total registered members: 361