AI Models Now Code for 19 Days Straight, But New Benchmarks Reveal Persistent Gaps
Epoch AI's MirrorCode benchmark reveals AI models can now program for 19 days straight, with Claude Opus 4.7 leading at 56% solve rate. Meanwhile, new tools like Weave's model router and Vercel's Eve framework aim to cut costs and simplify agent deployment.
Epoch AI and METR have released MirrorCode, a benchmark that tests whether AI models can recreate complete programs from scratch without seeing the original source code. Claude Opus 4.7 leads with a 56 percent solve rate, but every model tested still fails on the most complex tasks.
One MirrorCode task cost $2,600 to run and kept an AI model working continuously for 19 days with no human involvement. The benchmark includes 25 target programs spanning Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each solution must exactly reproduce the original program's output, including hidden end-to-end tests the model never sees during development.
Claude Opus 4.7 Rebuilds Bioinformatics Toolkit in 14 Hours
Claude Opus 4.7 reimplemented gotree, a bioinformatics toolkit with roughly 16,000 lines of Go code and over 40 commands, in 14 hours at a cost of $251. A human engineer working without AI help would need 2 to 17 weeks for the same job, according to the researchers. In the overall rankings, GPT-5.5 followed at 44 percent and Gemini 3.1 Pro Preview at 32 percent.
- Small programs like uuid or parseqsv are reliably reimplemented by all tested models.
- Medium tasks show wide variance, with leading models passing 90 percent or more of tests even when they fail to fully reimplement the program.
- Large tasks beat every model tested so far.
- Leading models from a year ago would have scored only about 30 percent and been limited to simpler programs like a calendar utility.
- Epoch AI has open-sourced the scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages.
Cost Pressures Drive New Routing and Agent Frameworks
As AI coding costs climb, developers are building tools to manage expenses. Weave, a company that writes nearly all its code with AI, released an open-source model router that plugs into coding agents like Claude Code, Codex, and Cursor. The router intelligently sends requests to the best model for each task, a response to cost spikes caused by tokenizer changes in Opus 4.7. Meanwhile, Vercel introduced Eve, an open-source framework for building, deploying, and operating AI agents in production. Eve uses a filesystem-based project structure to organize agent instructions, tools, skills, subagents, communication channels, and scheduled tasks, reducing the amount of supporting infrastructure developers need to implement.
HashiCorp also released a Terraform MCP server that helps AI agents make better infrastructure decisions using trusted organizational context and guardrails. The combination of longer-running autonomous coding tasks and new cost-management tools suggests the industry is moving toward more practical, production-ready AI development workflows. Epoch AI notes that while MirrorCode results were not dominated by memorization, they cannot rule out that memorization contributes to AI performance, a caveat that will shape future benchmark design.
Fact check
-
Claude Opus 4.7 leads MirrorCode with a 56 percent solve rate.
reported · source
-
One MirrorCode task cost $2,600 to run and kept an AI model working continuously for 19 days.
reported · source
-
Weave released an open-source model router that plugs into Claude Code, Codex, and Cursor.
reported · source
-
Vercel introduced Eve, an open-source framework for building AI agents.
reported · source
-
HashiCorp released a Terraform MCP server for AI infrastructure decisions.
reported · source
Source reporting (4)
- The Decoder · An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run
- Hacker News Front Page · Show HN: Smart model routing directly in Claude, Codex and Cursor
- InfoQ · Vercel Introduces Eve, an Open-Source Framework for Building AI Agents
- HashiCorp Blog · Terraform MCP server: Four real-world AI infrastructure patterns
Join the conversation
You need to be registered and logged in to comment on blog articles.
0 Comments
No comments yet
Be the first to share your thoughts on this article.