News Article · Jun 27, 2026 at 7:40 AM

3 min read 0

Member

Industry #model routing #AI #benchmarks #coding #MirrorCode #Claude Opus 4.7 #Vercel Eve #Terraform MCP

AI Models Now Code for 19 Days Straight, But New Benchmarks Reveal Persistent Gaps

Epoch AI's MirrorCode benchmark reveals AI models can now program for 19 days straight, with Claude Opus 4.7 leading at 56% solve rate. Meanwhile, new tools like Weave's model router and Vercel's Eve framework aim to cut costs and simplify agent deployment.

Listen to this article 4 min

Epoch AI and METR have released MirrorCode, a benchmark that tests whether AI models can recreate complete programs from scratch without seeing the original source code. Claude Opus 4.7 leads with a 56 percent solve rate, but every model tested still fails on the most complex tasks.

One MirrorCode task cost $2,600 to run and kept an AI model working continuously for 19 days with no human involvement. The benchmark includes 25 target programs spanning Unix utilities, data serialization, bioinformatics, interpreters, static analysis, cryptography, and compression. Each solution must exactly reproduce the original program's output, including hidden end-to-end tests the model never sees during development.

Claude Opus 4.7 Rebuilds Bioinformatics Toolkit in 14 Hours

Claude Opus 4.7 reimplemented gotree, a bioinformatics toolkit with roughly 16,000 lines of Go code and over 40 commands, in 14 hours at a cost of $251. A human engineer working without AI help would need 2 to 17 weeks for the same job, according to the researchers. In the overall rankings, GPT-5.5 followed at 44 percent and Gemini 3.1 Pro Preview at 32 percent.

Small programs like uuid or parseqsv are reliably reimplemented by all tested models.
Medium tasks show wide variance, with leading models passing 90 percent or more of tests even when they fail to fully reimplement the program.
Large tasks beat every model tested so far.
Leading models from a year ago would have scored only about 30 percent and been limited to simpler programs like a calendar utility.
Epoch AI has open-sourced the scaffold and 22 of the 25 target programs, covering 132 task instances across six programming languages.

Cost Pressures Drive New Routing and Agent Frameworks

As AI coding costs climb, developers are building tools to manage expenses. Weave, a company that writes nearly all its code with AI, released an open-source model router that plugs into coding agents like Claude Code, Codex, and Cursor. The router intelligently sends requests to the best model for each task, a response to cost spikes caused by tokenizer changes in Opus 4.7. Meanwhile, Vercel introduced Eve, an open-source framework for building, deploying, and operating AI agents in production. Eve uses a filesystem-based project structure to organize agent instructions, tools, skills, subagents, communication channels, and scheduled tasks, reducing the amount of supporting infrastructure developers need to implement.

HashiCorp also released a Terraform MCP server that helps AI agents make better infrastructure decisions using trusted organizational context and guardrails. The combination of longer-running autonomous coding tasks and new cost-management tools suggests the industry is moving toward more practical, production-ready AI development workflows. Epoch AI notes that while MirrorCode results were not dominated by memorization, they cannot rule out that memorization contributes to AI performance, a caveat that will shape future benchmark design.

Fact check

Claude Opus 4.7 leads MirrorCode with a 56 percent solve rate.

reported · source
One MirrorCode task cost $2,600 to run and kept an AI model working continuously for 19 days.

reported · source
Weave released an open-source model router that plugs into Claude Code, Codex, and Cursor.

reported · source
Vercel introduced Eve, an open-source framework for building AI agents.

reported · source
HashiCorp released a Terraform MCP server for AI infrastructure decisions.

reported · source

Source reporting (4)

0 Comments

No comments yet

Be the first to share your thoughts on this article.

Join the conversation

You need to be registered and logged in to comment on blog articles.

Revolut Ends Remote-First Hiring for Junior Staff, Mandates Three Office Days

Jun 27, 2026

SK Telecom Deepens US AI Bet With $480M Chip Investment as Anthropic Nears $965B IPO

Jun 27, 2026

Anthropic's Mythos 5 Restored for Select US Groups After White House Negotiations

Jun 27, 2026

Back to News Desk

AI Models Now Code for 19 Days Straight, But New Benchmarks Reveal Persistent Gaps

Claude Opus 4.7 Rebuilds Bioinformatics Toolkit in 14 Hours

Cost Pressures Drive New Routing and Agent Frameworks

Fact check

Source reporting (4)

0 Comments

Related Articles

Revolut Ends Remote-First Hiring for Junior Staff, Mandates Three Office Days

SK Telecom Deepens US AI Bet With $480M Chip Investment as Anthropic Nears $965B IPO

Anthropic's Mythos 5 Restored for Select US Groups After White House Negotiations

Who Is Online