AI music track experienced another shockwave in early 2026. On March 9, the music foundation model SongGeneration2, jointly developed by Tencent and Tsinghua University's Human-Computer Speech Interaction Lab, was officially released. This model not only achieved a qualitative leap in technical architecture but also directly "surpassed" current mainstream open-source models in multiple core dimensions, even competing head-on with top commercial models in overall quality.

image.png

Three Breakthroughs: Making AI Music No Longer "Plastic"

SongGeneration2’s core advantages stem from its comprehensive upgrade in the underlying architecture, primarily solving three pain points of past AI music:

  • High musicality: Unlike simple melody stacking, this model can handle complex multi-track arrangements with strong spatial depth.

  • High lyric accuracy: Unclear pronunciation and hallucination pitch shifts are a thing of the past. Its phoneme error rate (PER) is as low as 8.55%, significantly better than the top commercial model Suno v5 (12.4%), second only to MiniMax2.5.

  • Strong controllability: Whether it's text descriptions or audio prompts, it can accurately follow instructions, allowing for deep customization of style and emotion.

image.png

"Dual-Core" Drive: A Dream Collaboration Between LLM and Diffusion Models

In terms of architectural design, SongGeneration2 adopts an innovative hybrid LLM-diffusion architecture:

  • Composing Brain (LeLM): Responsible for planning global structure and vocal details, solving the question of "how to sing".

  • High-Fidelity Renderer (Diffusion): Synthesizes extremely complex acoustic details under the guidance of the language model.

  • Hierarchical Representation: The first to adopt parallel modeling of mixed representation and multi-track representation, balancing the stability of melodies and the delicacy of sound quality.

True Open Source, Low Barrier: Ordinary Computers Can Also "Write Songs"

The most exciting part for developers was Tencent's great openness. The SongGeneration-v2-large model with 4B parameters has been officially open-sourced, supporting multilingual generation including Chinese and English. Surprisingly, it can run smoothly on consumer-grade hardware with 22GB VRAM, making local and private creation possible.

To allow users to experience immediately, the project team also released the SongGeneration-v2-Fast version on HuggingFace, sacrificing minimal audio quality for ultra-fast generation—producing a complete single song within one minute.

From the performance of SongGeneration2, AI music has officially entered the realm of "commercial-level applications" from being a "geek toy." With the future open-sourcing of the Medium model supporting 12GB VRAM and an automated evaluation framework, the era of everyone becoming a "composer" may truly be coming soon.