Artificial intelligence has made significant progress in the field of audio generation, but the ability to "edit" existing audio still faces great challenges. Recently, Tencent Hy, in collaboration with several leading research institutions including Shanghai Jiao Tong University (SJTU), Nanyang Technological University (NTU), Tianjin University (TJU), Peking University (PKU), and Fudan University (FDU), jointly launched MMAE (Massive Multitask Audio Editing Benchmark)—the first large-scale multi-task benchmark for general instruction-driven audio editing. This release provides a systematic evaluation standard for the AI audio editing field, highlighting the obvious shortcomings in precise modification of current technology.

From "Generation" to "Editing": The Real Test of AI Audio Capabilities

Traditional audio AI mainly focuses on generating new content from text or prompts, while the core of the MMAE benchmark lies in requiring models to understand existing audio segments and make precise modifications based on natural language instructions: only adjust the parts that need changing, keeping the rest completely unchanged. This "editing rather than reconstruction" capability imposes higher requirements on audio fidelity, instruction following, and context understanding, and is more aligned with real-world applications such as podcast post-production, music mixing, or voice personalization.

Testing shows that current mainstream models have an overall Exact Match Rate (EMR) below 5%, revealing significant gaps in reliable audio editing technology. This means AI is prone to over-modification, missing instructions, or degrading original audio quality in practical editing tasks.

Highlights of the MMAE Benchmark: Multi-dimensional Evaluation for Real-World Scenarios

The MMAE benchmark is comprehensively and rigorously designed, mainly including the following core elements:

  • 2000 high-fidelity samples: All come from real-world scenarios, ensuring the practicality and diversity of the evaluation.
  • 17,741 fine-grained evaluation metrics: Provide a detailed rubric scoring system for objective quantification.
  • 7 modal settings: Cover sound, music, speech, and their mixed forms, supporting testing in complex audio environments.
  • 6 levels of task complexity: From basic modifications to multi-hop reasoning and multi-round editing, comprehensively examining model capabilities.
  • 8 types of operations: Support editing operations at different granularities, local and global, challenging the model's level of fine control.

AIbase Comment: MMAE is not only a technical evaluation tool, but also an important milestone in promoting the transformation of audio AI from "generative" to "editable". It provides researchers and developers with a unified standard, and is expected to accelerate the iteration of the next generation of audio editing models.

Future Outlook: Audio Editing May Become the Core Competitiveness of AI Multimodal Systems

With the rapid development of multimodal large models, accurate audio editing will play a key role in content creation, film post-production, and accessibility assistance. The recent collaboration by Tencent Hy and other institutions demonstrates China's leading position in the field of audio AI research. The industry expects more open-source resources and follow-up models to fill this technological gap together.