A recent joint study by Carnegie Mellon University and Stanford University points out that the current development of artificial intelligence agents (AI Agents) is facing serious "path dependence." The research shows that existing AI evaluation benchmarks are highly concentrated on programming tasks, while neglecting the non-programming fields that make up 92% of the U.S. labor market.
Researchers systematically analyzed 72,000 tasks from 43 mainstream AI benchmarks and compared them with 1,016 real jobs in the U.S. government O*NET occupational database.
The imbalanced situation found in the survey:
"Benchmark blind spots" in the digital industry: Despite the high level of digitization in managerial jobs, which reaches 88%, they account for only 1.4% in existing AI benchmark tests; legal jobs have a digitization level of 70%, but their share in the benchmark tests is as low as 0.3%.
Serious skills mismatch: Current AI evaluations mainly focus on "information retrieval" and "computer operation" skills, which cover less than 5% of U.S. jobs. Meanwhile, the "interpersonal interaction" category, which is crucial in real work, is almost completely ignored in existing AI tests.
"Ability drop" caused by increasing complexity: The study found that AI agents perform very poorly when dealing with complex tasks. Even in their most skilled area, software development, the success rate of AI drops sharply when the number of steps increases or the logic becomes more complex.
Researchers call for future AI benchmarks to focus more on high-value, highly digitized fields such as management, law, construction, and engineering. At the same time, evaluations should not only focus on the final results but also pay attention to the intermediate steps during the execution process, to address practical challenges such as vague goals and long verification cycles.
This conclusion is also supported by market data. A recent analysis by Anthropic showed that nearly 50% of its API calls are still concentrated on software development. Experts warn that if AI development continues to blindly pursue programming tasks that are easy to automatically score, it may miss the best opportunity for AI to demonstrate productivity value across a broader economic field.
