Artificial Analysis has recently released version 2.0 of its speech-to-text benchmark test (AA-WER v2.0). Test results show that ElevenLabs and Google demonstrate strong dominance in the field of audio transcription.

In terms of the core word error rate (WER) metric, ElevenLabs' Scribe v2 achieved an extremely low error rate of 2.3%, ranking first. Following closely behind is Google's Gemini3Pro, with an error rate of 2.9%. Notably, Google did not specifically train Gemini for transcription tasks, and this outstanding performance is entirely attributed to its strong multimodal general capabilities.
Other mainstream models performed as follows:
Mistral Voxtral Small: Ranked third with an error rate of 3.0%.
Google Gemini3Flash: Performed steadily, with an error rate of 3.1%.
OpenAI Whisper Large v3: As the most popular open-source model, it ranked in the middle with an error rate of 4.2%.
Bottom tier: Alibaba's Qwen3ASR Flash (5.9%), Amazon's Nova2Omni (6.0%), and Rev AI (6.1%) ranked at the bottom in the test.

In the specialized AA-AgentTalk test for voice assistant commands, the ranking remained stable. ElevenLabs' Scribe v2 and Google's Gemini3Pro led with error rates of 1.6% and 1.7%, respectively, demonstrating high reliability in handling short and direct voice interactions.
