Alibaba officially launched the latest member of the Qwen3 family, Qwen3-TTS, which is characterized by "zero-shot, multi-role, cross-language" text-to-speech synthesis. The new model significantly outperforms mainstream commercial engines on the word error rate (WER) international benchmark and is now available on the Alibaba Cloud console, allowing developers to call 1 million characters for free.
49 high-quality voices, one-click role switching
From gentle girl to dialect uncle, Qwen3-TTS includes 49 official voices, covering scenarios such as narration, customer service, live streaming, and education; supports 10 languages + 9 Chinese dialects (Cantonese, Sichuan dialect, Northeastern dialect, etc.), and can switch voices instantly for the same text without retraining.

Text → Tone → Rhythm, fully automatic "human-like"
The model uses an autoregressive acoustic model + prosody prediction module, which can automatically adjust pitch and insert pauses based on punctuation and emotional tags; at a 48kHz sampling rate, the MOS score reaches 4.53, significantly higher than the industry average of 4.1.
WER significantly better than commercial models
In the multilingual speech synthesis public test set (MLS + Common Voice), the English WER of Qwen3-TTS drops to 2.8%, and the Chinese WER to 1.9%, reducing by 18% and 24% compared to Azure TTS, setting a new open-source SOTA.
Zero-shot application in educational scenarios
Alibaba Cloud also released the "One-click Read" plugin, allowing teachers to upload PPTs to automatically generate lecture audio with dialects. It has already been piloted in 120 primary and secondary schools in Shanghai, helping students to write words in their "native dialect."
Pricing and Access
- Free tier: 1 million characters per month, 49 voices can be used unlimitedly
- Paid tier: 0.8 yuan per 10,000 characters, supports SSML and real-time streaming synthesis
- Console: console.aliyun.com → Artificial Intelligence → Speech Synthesis → Qwen3-TTS (fully available)
Next Steps
Alibaba revealed that in Q1 2025, it will open a "10-second voice cloning" interface, allowing users to generate private speakers by uploading short audio, and release an 80kHz ultra-sampling version, targeting the podcast, audiobook, and virtual idol markets.
Industry Insights
The TTS field is moving from "understandable" to "characterized." Qwen3-TTS is disrupting the commercial pools of Azure and AWS with open source and low-cost combinations, while providing "zero-shot" solutions for live streaming, customer service, and education scenarios. With the release of voice cloning and ultra-sampling versions, speech generation may enter a new era where everyone can have their own narrator. AIbase will continue to track its voice cloning interface release progress and commercial cases.
Project Address: https://modelscope.cn/studios/Qwen/Qwen3-TTS-Demo
