Starts with a Character! Alibaba Qwen3-TTS Makes Its Debut: 49 Voice Styles + 10 Languages, 9 Dialects, WER Outperforms Mainstream Commercial Models

Alibaba officially launched the latest member of the Qwen3 family, Qwen3-TTS, which is characterized by "zero-shot, multi-role, cross-language" text-to-speech synthesis. The new model significantly outperforms mainstream commercial engines on the word error rate (WER) international benchmark and is now available on the Alibaba Cloud console, allowing developers to call 1 million characters for free.

49 high-quality voices, one-click role switching

From gentle girl to dialect uncle, Qwen3-TTS includes 49 official voices, covering scenarios such as narration, customer service, live streaming, and education; supports 10 languages + 9 Chinese dialects (Cantonese, Sichuan dialect, Northeastern dialect, etc.), and can switch voices instantly for the same text without retraining.

Text → Tone → Rhythm, fully automatic "human-like"

The model uses an autoregressive acoustic model + prosody prediction module, which can automatically adjust pitch and insert pauses based on punctuation and emotional tags; at a 48kHz sampling rate, the MOS score reaches 4.53, significantly higher than the industry average of 4.1.

WER significantly better than commercial models

In the multilingual speech synthesis public test set (MLS + Common Voice), the English WER of Qwen3-TTS drops to 2.8%, and the Chinese WER to 1.9%, reducing by 18% and 24% compared to Azure TTS, setting a new open-source SOTA.

Zero-shot application in educational scenarios

Alibaba Cloud also released the "One-click Read" plugin, allowing teachers to upload PPTs to automatically generate lecture audio with dialects. It has already been piloted in 120 primary and secondary schools in Shanghai, helping students to write words in their "native dialect."

Pricing and Access

- Free tier: 1 million characters per month, 49 voices can be used unlimitedly

- Paid tier: 0.8 yuan per 10,000 characters, supports SSML and real-time streaming synthesis

- Console: console.aliyun.com → Artificial Intelligence → Speech Synthesis → Qwen3-TTS (fully available)

Next Steps

Alibaba revealed that in Q1 2025, it will open a "10-second voice cloning" interface, allowing users to generate private speakers by uploading short audio, and release an 80kHz ultra-sampling version, targeting the podcast, audiobook, and virtual idol markets.

Industry Insights

The TTS field is moving from "understandable" to "characterized." Qwen3-TTS is disrupting the commercial pools of Azure and AWS with open source and low-cost combinations, while providing "zero-shot" solutions for live streaming, customer service, and education scenarios. With the release of voice cloning and ultra-sampling versions, speech generation may enter a new era where everyone can have their own narrator. AIbase will continue to track its voice cloning interface release progress and commercial cases.

Project Address: https://modelscope.cn/studios/Qwen/Qwen3-TTS-Demo

Starts with a Character! Alibaba Qwen3-TTS Makes Its Debut: 49 Voice Styles + 10 Languages, 9 Dialects, WER Outperforms Mainstream Commercial Models

Related Recommendations

Aliyun Open Sources 0.8B Document Parsing Model OvisOCR2, Ends-to-End Solution Tops OmniDocBench

Report: Zhiyuan Robotics Said to Be Striving for IPO with a Target Valuation of $20 Billion

Report: Alibaba to Launch Qwen Office, Integrating Three Intelligent Entities to Enter the AI Office Market

Alibaba Releases Qwen-Image-3.0, Supporting 4.5K Token Ultra-Long Input and Complex Image-Text Generation

Robots Are Actually Easier to Build Than Cars: Zhang Wei of Shujiji Power Says the Brain of Humanoid Robots Has Reached GPT-3, the Industry Is at an Exponential Inflection Point