Recently, computer scientists from the UK Government AI Safety Institute, along with experts from Stanford University, the University of California, Berkeley, and the University of Oxford, conducted an in-depth study of over 440 benchmarks used to evaluate the safety and effectiveness of new artificial intelligence models. They found that almost all tests had flaws in certain areas, which could "undermine the validity of the results," and some test scores may be "irrelevant or misleading."

Image source note: The image was generated by AI
As major technology companies continue to launch new AI systems, public concerns about AI safety and effectiveness are increasing. Currently, neither the United States nor the United Kingdom has implemented nationwide AI regulations, and these benchmarks have become important tools for testing whether new AI systems are safe, align with human interests, and demonstrate their claimed capabilities in reasoning, mathematics, and coding.
The lead author of the study, Andrew Bean from the Oxford Internet Institute, said: "Benchmarking supports almost all claims about AI progress, but the lack of a unified definition and reliable measurement makes it difficult to determine whether a model is truly improving or just appearing to improve." He mentioned that Google recently withdrew its newly launched AI model, Gemma, because the model spread false accusations about U.S. senators, which were entirely fictional.
This is not an isolated incident. Character.ai recently also announced a ban on teenagers having open conversations with its AI chatbots due to controversies involving teenage suicides. The study shows that only 16% of benchmarks use uncertainty estimation or statistical tests to demonstrate their accuracy. In some benchmarks evaluating AI characteristics, concepts like "harmlessness" are ambiguously defined, leading to poor test performance.
Experts are calling for the development of shared standards and best practices to improve the AI evaluation process, ensuring its safety and effectiveness.
Key points:
🌐 The study found that more than 440 AI testing benchmarks have almost all defects, affecting the validity of the results.
🚨 Google's Gemma model was withdrawn due to spreading false accusations, highlighting the urgency of AI regulation.
📊 Only 16% of benchmarks use statistical tests, lacking standardization, and there is an urgent need to improve AI evaluation methods.
