True Emotional Freedom! Fish Audio Releases S2: Multi-Speaker, Word-Level Emotion Control, Fully Open Source

Fish Audio officially released its new text-to-speech (TTS) model S2, marking a major breakthrough in expressiveness and controllability of open-source TTS technology.

This model, named Fish Audio S2, focuses on strong emotional controllability. Users can achieve fine-grained prosody and emotion adjustments through natural language instructions, such as inserting tags like [laugh] (laugh), [whispers] (whisper), [super happy] (super happy), or even supporting free descriptions like [professional broadcast tone] (professional broadcasting tone) or [pitch up] (raise the pitch), achieving precise control at the word or phrase level, generating highly expressive and naturally vivid speech.

Key highlights include:

Completely open source: Model weights, fine-tuning code, and the streaming inference engine based on SGLang are all publicly available (available on GitHub and Hugging Face). S2-Pro is the flagship version (approximately 4.4 billion parameters).
Ultra-low latency: Inference latency is less than 150 milliseconds, suitable for real-time applications such as chatbots and virtual anchors.
Native multi-speaker support: Multiple speakers can be processed in a single inference, supporting dialogue turns, interruptions, natural emotional transmission, and voice consistency without additional processing.

Fish Audio stated that S2 was trained on approximately 10 million hours of audio data covering nearly 50 languages, combined with reinforcement learning alignment and a dual autoregressive architecture, showing leading naturalness and expressiveness in multiple benchmark tests. It is regarded as one of the most emotionally intelligent TTS systems among both open-source and closed-source solutions. "True linguistic freedom starts now," Fish Audio declared, signaling that the era of AI speech with real emotion and personality has arrived.

GitHub:https://github.com/fishaudio/fish-speech/

HuggingFace:https://huggingface.co/fishaudio/s2-pro/

Microsoft Open-Sources Real-Time Speech Model VibeVoice-Realtime-0.5B, 300ms Real-Time Voice Activation, No Breathing Even for 90-Minute Long Audio

Microsoft open-sources the real-time speech model VibeVoice-Realtime-0.5B, which offers extremely low latency and near-human voice performance. The model takes an average of only 300 milliseconds from text input to voice output, far less than traditional TTS models (1-3 seconds), achieving almost zero latency real-time speech synthesis.

Microsoft Launches VibeVoice-Realtime-0.5B: Achieving Almost Real-Time Natural Speech Generation with Just 0.5B Parameters

Microsoft has released the real-time text-to-speech model VibeVoice-Realtime-0.5B, which can start speaking in about 300 milliseconds with just 0.5B parameters, achieving near real-time smooth speech generation. The model supports real-time transcription and speech generation for both Chinese and English, with slightly lower performance in Chinese but maintaining overall high fluency and fidelity. The natural sound quality has attracted attention.

True Emotional Freedom! Fish Audio Releases S2: Multi-Speaker, Word-Level Emotion Control, Fully Open Source

Related Recommendations

Google Releases Its Strongest TTS Model, Supporting Nearly 70 Languages

Grok Can Also Speak Now! Musk's xAI Launches Voice API: The AI Voice Replacement Battle Rages On

Apple Releases PCG Voice Generation Technology: Say Goodbye to Stiff Verification, AI Voice Synthesis Speeds Up by 40%

Microsoft Open-Sources Real-Time Speech Model VibeVoice-Realtime-0.5B, 300ms Real-Time Voice Activation, No Breathing Even for 90-Minute Long Audio

Microsoft Launches VibeVoice-Realtime-0.5B: Achieving Almost Real-Time Natural Speech Generation with Just 0.5B Parameters