Today, the Speech Team of Alibaba Tongyi Lab announced the launch of two revolutionary voice generation models: Fun-CosyVoice3.5 and Fun-AudioGen-VD. The biggest highlight of these two models is the support for "FreeStyle" commands. Users do not need complex parameter adjustments; they can precisely control the expression style of the voice or build complex audio scenes from scratch by simply providing a natural language description.

The functions of the two models have different focuses:
Fun-CosyVoice3.5: Multilingual Replication and Fine-grained Control
This model is an upgraded version of the previous CosyVoice, with core breakthroughs in the "understanding ability" of speech expression.
Command-based Generation: Users can input commands such as "speak more confidently" or "slow down the speaking speed and add some emotional fluctuations," and the model will adjust the output in real time.
Language Expansion: It now supports Thai, Indonesian, Portuguese, and Vietnamese, maintaining industry-leading performance in transcription accuracy (WER) and voice similarity across 13 languages.
Obscure Character Optimization: Through specialized optimization, the error rate for obscure characters has been significantly reduced from 15.2% to 5.3%.
Performance Improvement: The first packet delay has been reduced by 35%, greatly improving the smoothness in real-time interaction scenarios.
Fun-AudioGen-VD: Full-scene Sound Design
This model is more like a "sound director," capable of generating integrated audio that includes "characters + scenes."
Sound Customization: It supports specifying gender, age, accent, and even further details such as "hoarse, deep, or low-pitched" characteristics.
Emotion and Role: It can simulate roles such as customer service representatives, announcers, and children, and even express complex psychological states such as "externally calm but internally trembling."
Environmental Immersion: It supports adding background sounds (such as battlefield noise or café chatter) and spatial effects (such as cathedral echoes or underwater sound perception), achieving comprehensive spatial simulation.
The Tongyi Lab stated that the release of these two models will further reduce the barriers to high-quality voice creation, providing powerful AI support for fields such as podcasts, game development, and film post-production.
