While pursuing the "high IQ" of large models, AI's ability to perform continuously has become a new dimension for measuring its level of evolution. According to the latest benchmark test released by the artificial intelligence research institution METR, the top model from Anthropic, Claude Opus4.5, has demonstrated dominant strength in handling long-duration tasks.

image.png

The test results show that Claude Opus4.5 can handle complex tasks for about 4 hours and 49 minutes while maintaining a 50% success rate, setting a new industry record. The so-called "time resolution" metric reveals the endurance limits of the model under different difficulty challenges: when facing simple tasks (80% success rate), it only takes 27 minutes to complete them; however, once entering the deep waters of high-difficulty and time-consuming tasks, the advantages of Opus4.5 are greatly amplified.

AIbase noted that although the test data showed a figure suggesting the model could theoretically work continuously for more than 20 hours, METR admitted this might be due to errors caused by a small sample size. Nevertheless, this breakthrough still marks that AI is transitioning from "short instruction responders" to "long-term project executors."

However, some experts have raised doubts about the limitations of this test. Currently, METR only covers 14 samples, and some believe such benchmark tests may be "scored" by the model specifically. But there is no doubt that the emergence of Claude Opus4.5 indeed provides new possibilities for AGI tasks requiring high-intensity and long-term logical support.