27B Mathematical SOTA and 3-Second Emotional Cloning: Youdao Fully Opens Source for Zi Yue 4 Multimodal and TTS Engine

Recently, NetEase Youdao announced the comprehensive upgrade of the "Confucius4" large model to version 4.0. "Confucius4" has officially entered the full-modal era, not only fully supporting the integrated interaction of text, images, and audio, but Youdao also announced that the core "multi-modal model" and "text-to-speech (TTS) model" have been officially open-sourced. At the same time, the translation model has also undergone a deep technical restructuring, achieving dual improvements in translation quality and efficiency.

The multi-modal model achieves SOTA in vision and mathematics, with superior performance in pure text mathematical problems

According to the introduction, the open-sourced "Confucius4" multi-modal model, with a parameter scale of 27B, has brought the visual input-based math capabilities to an industry-leading level (SOTA) for educational scenarios. Among models with the same parameter scale, "Confucius4" performs impressively in handling high-level visual math and physics problems with charts. The performance of Chinese pure text math problems has also seen significant improvement, with a model accuracy rate of 81.4%, reaching an industry-leading level.

Article image-1

▲ Confucius4 reaches the best level among models of the same scale on multiple visual math benchmarks

Image source: https://huggingface.co/netease-youdao/Confucius4

A more critical breakthrough lies in practical cost-effectiveness. According to relevant officials, the new model adopts a refined reasoning chain reconstruction scheme, through aggregating a large amount of high-quality, concise reasoning samples for deep optimization, successfully compressing the output length of the reasoning chain by 43.2%.

This means it can provide answers faster with fewer Tokens and shorter reasoning paths, significantly reducing the reasoning costs in actual business scenarios for enterprises and developers.

Article image-1

▲ Confucius4 significantly reduces the number of output tokens on multiple visual math benchmarks

Image source: https://huggingface.co/netease-youdao/Confucius4

In addition, the Confucius4 research team has deeply optimized for the real homework, exam, and question scenarios of Chinese students, enabling it to truly solve the real learning problems faced by Chinese students, becoming a more compassionate digital assistant.

Open-source TTS: supports 14 languages, clones original voice in 3 seconds, no accent issues across languages

Along with the multi-modal model, the speech synthesis (TTS) engine was also open-sourced. This engine is built based on the cutting-edge "speech encoder + LLM" architecture, providing developers and content creators with zero-shot, low-barrier voice cloning and emotional synthesis capabilities.

Currently, it fully supports 14 languages, including Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese. The system can naturally transfer the speaker's voice across different languages without additional training, maintaining voice consistency while ensuring native-level naturalness and fluency in the synthesized results, with no accent leakage during cross-language cloning.

In terms of voice cloning, Confucius4 achieves full support for "upload and clone," allowing users to simply provide any audio material, and the system can complete the original voice replication within three seconds. According to the introduction, the accuracy of this engine in cloning tasks exceeds 97%, and the similarity between the cloned voice and the original voice is over 85%. While preserving the unique voice characteristics of the speaker, it can accurately reproduce their emotional tone, with comprehensive capabilities at the top tier in this field.

Additionally, this open-source model demonstrates good robustness in real multilingual scenarios, capable of handling various synthesis needs such as daily conversations, news broadcasts, and corporate promotions, as well as complex emotional expressions.

Translation model quality upgraded comprehensively, with 80% faster inference speed

As one of Youdao's most profound technological assets, the translation model also underwent significant technical upgrades in this update, further enhancing its performance in translation tasks.

In terms of data, the Confucius team collected and cleaned billions of multilingual data, and hired professionals with TEM-8 certification for multidimensional manual evaluation, ensuring the high quality of the corpus from the source.

In terms of algorithms, the model uses an innovative "multi-expert OPD" mode, adopting a smarter "soft approach" to draw on the strengths of various experts. It also introduces format rewards and language detection mechanisms through reinforcement learning, effectively solving common issues like out-of-context translations and mixed language outputs in machine translation.

To meet the needs of high-frequency, high-concurrency industrial applications, the updated translation model is equipped with an efficient acceleration mechanism, which directly increases the overall inference speed by 80%. Combined with a customized solution of automated large model evaluation and random manual sampling, the new generation of the translation model demonstrates an extremely high standard of speed and quality in multiple scenarios, including text, image, and document translation.

Looking back at Youdao's exploration in the AI field, from the initial launch of Confucius as the first education-focused large model, introducing the "virtual speaking coach Hi Echo" that overturned traditional oral practice methods, to the comprehensive rooting of Confucius 2.0 and 3.0 versions in software and hardware ecosystems, Youdao has always been at the forefront of AI empowerment in scenarios. In 2026, Youdao accelerated the application of AI, launching a series of AI Agent products such as LobsterAI, Youdao Treasure, Youdao Conference Agent, and Thinkflow, realizing a forward-looking layout of the full-scenario AI Agent matrix.

The upgrade of "Confucius4" and the full open-source of core models not only significantly lowers the application barriers for developers in the fields of multi-modal and speech synthesis, but also demonstrates the ecological closed-loop of nurturing upper-layer Agent matrices with underlying core technologies. Youdao hopes that, with the joint contribution of global developers and the open-source community, this full-modal large model ecosystem will unleash true productivity transformation in more extensive industries.

Appendix: Open-source addresses:

"Confucius4" multi-modal model: https://huggingface.co/netease-youdao/Confucius4

"Confucius4" TTS model: https://github.com/netease-youdao/Confucius4-TTS

27B Mathematical SOTA and 3-Second Emotional Cloning: Youdao Fully Opens Source for Zi Yue 4 Multimodal and TTS Engine

Related Recommendations

Step Star's First AI Intelligent Phone Will Be Released, Leading Ahead of OpenAI

MiniMax to Launch New Generation Large Model with 2.7 Trillion Parameters

New Heights in AI Creation: ByteDance Releases Seedream 5.0 Pro, Opening the Era of Interactive Precision Editing

Step Astronomy to Launch the World's First AI Agent Phone from a Major Large Model Vendor

Tencent Hunyuan Multimodal Team Welcomes Strong Addition, Former OpenAI Researcher Tian Yonglong Exposed to Join