In the field of education, traditional standardized tests can assess whether students have mastered calculus or can understand texts, but they struggle to measure students' abilities to resolve conflicts in teams, generate innovative ideas under pressure, or critically analyze arguments. These so-called "durable skills" — collaboration, creativity, and critical thinking — have long lacked effective and scalable measurement tools. Recently, the Google research team proposed a new approach called Vantage, a technique that uses large language models (LLMs) to simulate real group interactions and accurately score these skills.

image.png

The research team found that the challenge of assessing durable skills lies in the contradiction between ecological validity and psychometric rigor. Assessments need to take place in real-world contexts while also being comparable and reproducible. Previous attempts, such as the collaborative problem-solving assessment in PISA 2015, used multiple-choice questions and scripted simulated teammate interactions. While they controlled variables, they lost the sense of realism. The Google team believes that LLMs can achieve a balance between these two aspects.

The core of Vantage is the "executing LLM" architecture, which uses a single LLM to generate responses from all AI participants. This approach has the advantage of coordinating conversations and proactively guiding them according to predefined educational standards. For example, for conflict resolution skills, the executing LLM can actively create disagreements, testing the reactions of human participants. The study showed that compared to uncoordinated independent agents, the conversations of the executing LLM performed better in two collaboration sub-skills, with data indicating a significant increase in evidence of key behaviors.

The research team recruited 188 participants aged 18 to 25 and collected 373 conversation records through 30-minute collaborative tasks with AI characters. The conversations were scored by two human raters from New York University and an AI evaluation tool, and the results showed good consistency between AI scoring and expert human scoring. Especially in creativity and critical thinking, the performance of the executing LLM was also better than that of independent agents, offering new insights for future educational assessments.

Key points:

📊 The Vantage method combines large language models to simulate real team interactions and accurately score durable skills.

🤖 The executing LLM architecture coordinates multiple AI characters, proactively guiding conversations to improve the assessment of key behaviors.

🎓 The study shows that AI scoring is consistent with expert human scoring, opening up new possibilities for educational assessment.