On the track of AI empowering content creation, ByteDance's Volcano Engine has recently delivered a major achievement. On June 23, the
Previously, creators often needed to separately generate dialogues, sound effects, and background music, then align and mix them manually, which required strong post-production skills. The breakthrough brought by Douyin Audio Generation Model 1.0 is that it highly condenses this process—users just need to input a prompt containing character dialogue, emotional tone, background music, and environmental atmosphere, and the model can directly produce a complete audio piece with narrative tension.

To address the common issue of "character confusion" in long audio creation, the model achieves deep integration between text-to-audio and reference audio. Whether creating long audiobooks or complex podcasts, the model can maintain consistent character voice features across multiple extended audio segments. This ability to deliver consistent voice throughout the entire production greatly meets the strict demands of professional creators for long-form generation scenarios.
In addition, the model also has strong "zero-sample multimodal audio creation" capabilities. By supporting text descriptions or reference audio input, creators can obtain high-quality target audio without additional training. The model realizes deep decoupling in voice and style control, supports "one voice for multiple roles," allowing the same voice to demonstrate high expressiveness in different emotions and scenarios, significantly lowering the barrier to professional audio production.
Currently, Volcano Ark has opened API testing for this model, and individual users can directly get 30 minutes of creative quota. As this technology is about to be launched on platforms such as CapCut, Ji Meng, and Tomato, audio creation is evolving from cumbersome "editing and splicing" to efficient "creative direction." This model is not only a technological breakthrough but also marks that AI is becoming the most powerful "all-around assistant" in the hands of content creators.
