Hume AI has recently open-sourced its latest speech generation model, TADA (Text-Acoustic Dual Alignment), a text-to-speech (TTS) system based on large language models. It uses an innovative text-acoustic dual alignment architecture, significantly improving generation efficiency, reliability, and application scenarios.

According to the official introduction, TADA achieves 1:1 strict synchronization between text tokens and acoustic representations, completely solving the common token-level content hallucination issues in traditional LLM-based TTS systems. In evaluations with over 1,000 test samples, the model achieved zero content hallucination performance.

In terms of performance, TADA generates audio more than five times faster than comparable LLM TTS systems, while consuming extremely low resources: only 2-3 frames of computing resources per second of audio, whereas traditional solutions usually require 12.5 to 75 frames. This allows the model to perform local inference on low-power hardware such as mobile phones and edge devices without relying on cloud servers.

TADA supports multiple languages, including Chinese (multilingual versions are based on the Llama3.23B parameter scale), and provides 1B (mainly for English) and 3B multilingual pre-trained models. The model uses a context window of 2048 tokens, capable of generating approximately 700 seconds of continuous audio at once, far exceeding the traditional solutions that can only support about 70 seconds under the same token limit.

Another important innovation is the synchronous transcription feature: the model directly outputs corresponding text transcriptions while generating speech, without requiring an additional separate speech recognition (ASR) process, thus achieving zero additional delay in text output. This feature holds significant value for real-time subtitles, voice interaction, and content creation applications.

In human subjective evaluations, TADA ranks second in naturalness and voice similarity, surpassing several systems with larger parameter scales and more training data, demonstrating highly competitive speech quality.

Link: https://huggingface.co/collections/HumeAI/tada