After AIGC swept through the fields of images and text, the last "human stronghold" in the film and television industry—voice acting—is now being broken through by Alibaba's Tongyi Lab. On March 16, Tongyi Lab officially released and open-sourced the world's first multimodal large model for film-level, multi-scenario voice acting, Fun-CineForge.
For a long time, AI voice acting has struggled to shake off the labels of "mechanical" and "announcer-like" tones. Especially in film and television scenarios, the emotional outbursts of characters, the mixing of ambient sounds, and lip synchronization have always been insurmountable gaps for AI. The emergence of Fun-CineForge is precisely aimed at solving this problem.
This large model adopts a revolutionary "data + model" integrated design. In addition to the model itself, Tongyi Lab also provided a method for building a high-quality dataset. This means that AI is no longer simply reading text but can deeply understand the complex context in films and TV shows, restoring subtle emotional fluctuations and spatial audio effects in various scenarios.
As a new member of the Alibaba Tongyi family, the open-source nature of Fun-CineForge is highly impactful. It not only provides video creators with a "film-level" post-production tool, but also, through technology dissemination, allows mid-length dramas and even individual creators to complete high-quality multilingual dubbing at a very low cost.
From the previously released Qwen3-Omni to the current Fun-CineForge, the Tongyi series is accelerating to complete the last piece of the multimodal puzzle. When AI truly learns to "act like a human," the logic of film translation and post-production may be completely rewritten. Currently, the model and its dataset construction plan are already available on relevant open-source platforms. This wave of "film-level AI" popularization is coming faster than we imagined.