Hume AI Open Sources TADA: 5x Speed, Zero Hallucination TTS That Can Run 700-Second Audio on Mobile Devices

Hume AI has recently open-sourced its latest speech generation model, TADA (Text-Acoustic Dual Alignment), a text-to-speech (TTS) system based on large language models. It uses an innovative text-acoustic dual alignment architecture, significantly improving generation efficiency, reliability, and application scenarios.

According to the official introduction, TADA achieves 1:1 strict synchronization between text tokens and acoustic representations, completely solving the common token-level content hallucination issues in traditional LLM-based TTS systems. In evaluations with over 1,000 test samples, the model achieved zero content hallucination performance.

In terms of performance, TADA generates audio more than five times faster than comparable LLM TTS systems, while consuming extremely low resources: only 2-3 frames of computing resources per second of audio, whereas traditional solutions usually require 12.5 to 75 frames. This allows the model to perform local inference on low-power hardware such as mobile phones and edge devices without relying on cloud servers.

TADA supports multiple languages, including Chinese (multilingual versions are based on the Llama3.23B parameter scale), and provides 1B (mainly for English) and 3B multilingual pre-trained models. The model uses a context window of 2048 tokens, capable of generating approximately 700 seconds of continuous audio at once, far exceeding the traditional solutions that can only support about 70 seconds under the same token limit.

Another important innovation is the synchronous transcription feature: the model directly outputs corresponding text transcriptions while generating speech, without requiring an additional separate speech recognition (ASR) process, thus achieving zero additional delay in text output. This feature holds significant value for real-time subtitles, voice interaction, and content creation applications.

In human subjective evaluations, TADA ranks second in naturalness and voice similarity, surpassing several systems with larger parameter scales and more training data, demonstrating highly competitive speech quality.

Link: https://huggingface.co/collections/HumeAI/tada

Musk's xAI Launches Voice API: The AI Mouth Replacement Battle Rages On

Musk's xAI company has officially launched the Grok Text to Speech API, enabling AI assistants to have voice interaction capabilities. This move not only expands Grok's multimodal functions but also provides developers with a convenient interface to integrate its conversational abilities into various applications, promoting the development of a more human-like AI ecosystem.

Reverie Launches a Speech Recognition Model Dedicated to India, Outperforming Deepgram

Reverie company launched a new text to speech model, supporting Hindi, English, and Hinglish mixed language, adapting to India's multilingual environment. The model has processed 3 million API calls and has shown high accuracy and fast response capabilities in industries such as banking and call centers.

Hume AI Voice Conversion Feature Launches - Capture Your Perfect Voice Soul in One Go

Hume AI's new 'Voice Conversion' feature enables users to transfer their vocal rhythm, pronunciation, and intonation to any target voice with just one recording. Now available in Creator Studio and API, it shifts voice AI from robotic speech to emotional expression, unlocking creative possibilities.....

Chinese Visual and Speech Open Source Model VITA-1.5 Released with GPT-4o Level Advanced Speech and Visual Capabilities

Recently, significant progress has been made in multimodal large language models (MLLMs), particularly in the integration of visual and text modalities. However, with the increasing prevalence of human-computer interaction, the importance of the speech modality has become more prominent, especially in multimodal dialogue systems. Speech is not only a key medium for information transmission but also significantly enhances the naturalness and convenience of interactions. Nevertheless, due to the inherent differences between visual and speech data, integrating them into MLLMs is not an easy task. For example, visual data conveys spatial information, while speech data conveys information in a temporal sequence.

Say Goodbye to Voice Cloning Infringement! Hume AI Launches Voice Control Feature to Create Personalized AI Voices

Hume AI, a startup focused on emotional intelligent voice interfaces, has recently launched an experimental feature called 'Voice Control.' This new tool is designed to help developers and users create personalized AI voices without any coding, AI prompt engineering, or sound design skills. Users can easily customize voices to meet their needs by precisely adjusting voice characteristics. This new feature builds on the company's previously launched 'Empathetic Voice Interface 2' (EVI2), which enhances the naturalness of speech.