Claude Opus 4.7 Released: Reliability Is More Important Than Intelligence

Anthropic's pace this year remains fierce, with new actions almost every other day. Just now, the highly anticipated Claude Opus 4.7 was officially released. Interestingly, Anthropic directly admitted in the announcement: "This is not our most powerful model." The legendary stronger Claude Mythos Preview still remains on standby. However, even so, Opus 4.7 has caused great attention—because it solves the issue of "more reliable" rather than "smarter."

In terms of benchmark scores, the results are quite impressive. On the hard-core programming benchmark SWE-bench Pro, 4.7 jumped from 53.4% of the previous version to 64.3%, an increase of nearly 11 percentage points, leaving GPT-5.4 (57.7%) and Gemini 3.1 Pro (54.2%) behind. On the visual reasoning benchmark CharXiv, it rose from 69.1% to 82.1%, thanks to the newly added 2576-pixel long-side recognition capability, with clarity more than three times that of the previous version. On the tool call evaluation MCP-Atlas, it achieved 77.3%, and on the legal AI platform Harvey's BigLaw benchmark, it reached 90.9%. However, on the Agentic search evaluation BrowseComp, 4.7 slightly dropped from 83.7% to 79.3%, being overtaken by GPT-5.4 and Gemini—this is exactly due to its "not making up answers" personality, preferring to report errors rather than guess when information is missing.

Beyond the numbers, what's more worth noting is its "change in temperament." Replit's leader said after testing: "It will challenge me in technical discussions, help me make better decisions, and really act like a better colleague." Data science platform Hex also found that 4.7 directly reports errors when data is missing, rather than providing a "seemingly reasonable but completely wrong" alternative value as before. At the same time, task resilience has significantly improved—Notion team tests show that the tool error rate has been reduced to one-third of the previous level, and when the tool chain crashes, it can bypass obstacles and complete the task on its own. Vercel even discovered a new behavior: before starting to write system-level code, 4.7 first does mathematical proofs on its own.

Certainly, getting stronger comes at a cost. 4.7 introduced a new tokenizer, producing 1 to 1.35 times more tokens for the same text. Plus, it tends to "think for a bit longer" on complex tasks, so actual consumption is almost certainly going up. To address this, Anthropic added an xhigh ultra-high thinking intensity level. Claude Code has set all packages to this level by default, and also launched the Deep Review instruction / ultrareview, Auto Mode extension for Max users, and a public beta version of the "task budget" feature to help developers manage token spending.

The stronger Mythos Preview was recently opened to enterprises under the name "Project Glasswing" for cybersecurity research, but due to its overwhelming power and incomplete security evaluation, it is not publicly released yet.

The 4.7 today is the latest anchor point in Anthropic's high-frequency delivery rhythm. And Mythos will eventually come—when that happens, the 4.7 that seems already very strong now might just be the beginning.

Claude Opus 4.7 Released: Reliability Is More Important Than Intelligence

Related Recommendations

Spent Annual Budget in 4 Months! Uber Urgently Tightens Employee AI Limits, Tech Giants Hit ROI Illusions

New Weapon for Cybersecurity! Anthropic's 'Myth' Large Model Expands Internal Testing, Uncovering Tens of Thousands of High-Risk Vulnerabilities

Anthropic Secretly Files for IPO, Strongest Secure AI Model Mythos Suddenly Opens Internal Testing

Preventing Hackers or Raising Tigers? Anthropic Opens Access to Top Cybersecurity Model Mythos, Adds 150 Partner Institutions

Claude iOS Version Will Soon Get a Redesigned Settings Menu and Support for Memory File Function