OpenAI's Best Life Science AI Model Fails 63.9% of Benchmark, Highlighting Limits of Current Systems

OpenAI's top GPT-Rosalind model passed only 36.1% of a new 750-task life science benchmark. The result underscores persistent AI weaknesses and aligns with broader industry concerns about AI reliability.

Listen to this article 3 min

This week, OpenAI released a new benchmark called LifeSciBench, a 750-task test designed to measure whether AI systems can support realistic life science research. The top-performing model, GPT-Rosalind, achieved a pass rate of just 36.1 percent, failing nearly two-thirds of the tasks.

The test goes beyond simple biology questions. It requires models to work with supporting documents, figures, and complex datasets. When tasks involved artifacts or URLs, GPT-Rosalind's pass rate dropped from 45.1 percent on text-only tasks to 28.1 percent.

Performance Plummets with Complex Inputs

The benchmark revealed a familiar weakness: AI systems perform better when everything is presented as text. Once forced to handle external data, performance falls off. OpenAI notes that models are becoming capable of scientific communication and evidence synthesis, but they remain far from autonomous scientists.

GPT-Rosalind led all tested models but still failed 63.9 percent of benchmark tasks.
Text-only tasks: 45.1 percent pass rate.
Tasks requiring artifacts or URLs: 28.1 percent pass rate.
LifeSciBench includes 750 tasks measuring realistic research support, not just fact recall.

Industry Warnings on AI Reliability

The findings arrive amid broader unease about AI's reliability. Harvard Business Review recently warned that AI-generated low-quality output, dubbed "workslop," is degrading the information companies rely on for decisions. A Microsoft executive acknowledged that humans are struggling to keep up with AI advancement, suggesting a narrowing window to understand AI before it becomes too complex to control.

Yann LeCun, former chief AI scientist at Meta, criticized the industry's trajectory, calling Elon Musk's xAI "kind of a failure" and suggesting that the entire AI sector may need a reset. The criticism points to a common theme: current AI models, including the most advanced, still suffer from fundamental limitations in reasoning and handling real-world complexity.

OpenAI's LifeSciBench is not intended to prove AI useless in research. The company argues that models can already assist scientists overwhelmed by information. But the benchmark serves as a reality check. Autonomous scientific discovery remains out of reach, and the gap between promise and performance is still wide.

What comes next is uncertain. Startups like 2Brains Inc., co-founded by tech pundit Robert Cringely, are attempting to solve LLM hallucinations through patented architectures. But until benchmarks like LifeSciBench show significant improvement, the industry's confidence in AI as a reliable research tool will remain limited.

Fact check

OpenAI released a 750-task benchmark called LifeSciBench this week.

reported · source
GPT-Rosalind achieved a pass rate of 36.1 percent.

reported · source
Harvard Business Review warned about AI 'workslop' degrading company information.

reported · source
A Microsoft executive acknowledged humans are struggling to keep up with AI advancement.

reported · source

Source reporting (8)

0 Comments

No comments yet

Be the first to share your thoughts on this article.

Join the conversation

You need to be registered and logged in to comment on blog articles.

Polymarket Paid Creators to Post Deceptive Videos of Fake Bets, WSJ Investigation Finds

Jun 21, 2026

Windows 11 26H2 brings no new features, Media Player memory hog sparks backlash

Jun 21, 2026

Amazon, Google, and Microsoft abandon human-in-the-loop AI oversight as attention fails

Jun 21, 2026

Back to News Desk

OpenAI's Best Life Science AI Model Fails 63.9% of Benchmark, Highlighting Limits of Current Systems

Performance Plummets with Complex Inputs

Industry Warnings on AI Reliability

Fact check

Source reporting (8)

0 Comments

Related Articles

Polymarket Paid Creators to Post Deceptive Videos of Fake Bets, WSJ Investigation Finds

Windows 11 26H2 brings no new features, Media Player memory hog sparks backlash

Amazon, Google, and Microsoft abandon human-in-the-loop AI oversight as attention fails

Who Is Online