Google AI Research Introduces Vantage: A New Approach to Evaluating Collaboration and Creativity Based on Large Language Models

In the field of education, traditional standardized tests can assess whether students have mastered calculus or can understand texts, but they struggle to measure students' abilities to resolve conflicts in teams, generate innovative ideas under pressure, or critically analyze arguments. These so-called "durable skills" — collaboration, creativity, and critical thinking — have long lacked effective and scalable measurement tools. Recently, the Google research team proposed a new approach called Vantage, a technique that uses large language models (LLMs) to simulate real group interactions and accurately score these skills.

The research team found that the challenge of assessing durable skills lies in the contradiction between ecological validity and psychometric rigor. Assessments need to take place in real-world contexts while also being comparable and reproducible. Previous attempts, such as the collaborative problem-solving assessment in PISA 2015, used multiple-choice questions and scripted simulated teammate interactions. While they controlled variables, they lost the sense of realism. The Google team believes that LLMs can achieve a balance between these two aspects.

The core of Vantage is the "executing LLM" architecture, which uses a single LLM to generate responses from all AI participants. This approach has the advantage of coordinating conversations and proactively guiding them according to predefined educational standards. For example, for conflict resolution skills, the executing LLM can actively create disagreements, testing the reactions of human participants. The study showed that compared to uncoordinated independent agents, the conversations of the executing LLM performed better in two collaboration sub-skills, with data indicating a significant increase in evidence of key behaviors.

The research team recruited 188 participants aged 18 to 25 and collected 373 conversation records through 30-minute collaborative tasks with AI characters. The conversations were scored by two human raters from New York University and an AI evaluation tool, and the results showed good consistency between AI scoring and expert human scoring. Especially in creativity and critical thinking, the performance of the executing LLM was also better than that of independent agents, offering new insights for future educational assessments.

Key points:
📊 The Vantage method combines large language models to simulate real team interactions and accurately score durable skills.
🤖 The executing LLM architecture coordinates multiple AI characters, proactively guiding conversations to improve the assessment of key behaviors.
🎓 The study shows that AI scoring is consistent with expert human scoring, opening up new possibilities for educational assessment.

Google AI Research Introduces Vantage: A New Approach to Evaluating Collaboration and Creativity Based on Large Language Models

Related Recommendations

Wikipedia Releases New Editing Rules: Voting Passed, Strictly Prohibits Using AI to Generate or Rewrite Article Content

Technical Optimization Still Needs Refinement: Meta Announces Llama4 Release Plan Delayed to May

Yann LeCun Enters the World Model: His AI Startup Completes $1.03 Billion in Funding

Small but Strong, Lightweight but Fast! Qwen3.5 Introduces Multiple Small-Sized Models Compatible with Consumer Graphics Cards

Volume Halved, Performance Uncompromised! Spain's Multiverse Challenges OpenAI with Quantum Compression Technology