Fish Audio officially released its new text-to-speech (TTS) model S2, marking a major breakthrough in expressiveness and controllability of open-source TTS technology.
This model, named Fish Audio S2, focuses on strong emotional controllability. Users can achieve fine-grained prosody and emotion adjustments through natural language instructions, such as inserting tags like [laugh] (laugh), [whispers] (whisper), [super happy] (super happy), or even supporting free descriptions like [professional broadcast tone] (professional broadcasting tone) or [pitch up] (raise the pitch), achieving precise control at the word or phrase level, generating highly expressive and naturally vivid speech.
Key highlights include:
- Completely open source: Model weights, fine-tuning code, and the streaming inference engine based on SGLang are all publicly available (available on GitHub and Hugging Face). S2-Pro is the flagship version (approximately 4.4 billion parameters).
- Ultra-low latency: Inference latency is less than 150 milliseconds, suitable for real-time applications such as chatbots and virtual anchors.
- Native multi-speaker support: Multiple speakers can be processed in a single inference, supporting dialogue turns, interruptions, natural emotional transmission, and voice consistency without additional processing.
Fish Audio stated that S2 was trained on approximately 10 million hours of audio data covering nearly 50 languages, combined with reinforcement learning alignment and a dual autoregressive architecture, showing leading naturalness and expressiveness in multiple benchmark tests. It is regarded as one of the most emotionally intelligent TTS systems among both open-source and closed-source solutions. "True linguistic freedom starts now," Fish Audio declared, signaling that the era of AI speech with real emotion and personality has arrived.
GitHub:https://github.com/fishaudio/fish-speech/
HuggingFace:https://huggingface.co/fishaudio/s2-pro/
