ByteDance Collaborates with HKUST to Release MMProLong: Long Document LMM Training Q&A Pairs are Far More Efficient than OCR Transcription

On May 24, the ByteDance Seed team collaborated with the Hong Kong University of Science and Technology to release a new research result on long document training for multimodal large language models (LMMs). Researchers built a new model called MMProLong based on Alibaba's open-source Qwen2.5-VL and achieved breakthrough progress in the efficiency of long document processing. This study not only breaks the traditional path of long text training in multimodal models but also reveals the key impact of data organization on the model's long context capability.

The core findings of this research directly address the pain points in current LMM training: in multimodal long document training, question-and-answer (QA) training for specific objectives is significantly more effective than traditional optical character recognition (OCR) transcription. Experiments show that using pure text transcription as a training task not only fails to improve the model's ability to locate content in long contexts but also leads to performance degradation. However, training using long-context QA pairs generated by an independent model (such as ByteDance Seed2.0) can guide the model to accurately retrieve target paragraphs amid lengthy and distracting information.

Based on this optimized strategy, MMProLong demonstrates strong long-text stability with a limited training budget of only 128,000 tokens, maintaining performance even when input length reaches 256,000 or 512,000 tokens. It significantly outperforms larger open-source models such as InternVL3-38B and Gemma3-27B on the MMLongBench and MM-NIAH (Needle-in-a-Haystack) benchmarks. In addition, the multimodal capabilities of MMProLong have been successfully transferred to long video understanding tasks that were not specifically trained, and the effectiveness of this strategy was also validated on the Qwen3-VL-8B model.

This study provides an alternative development path for the current large model industry, different from DeepSeek (which upgrades architecture through highly compressed and re-ordered visual information). It proves that long context capabilities can be significantly improved by optimizing the structure of training data rather than modifying the underlying architecture, opening up more economically and efficiently feasible technical possibilities for future developments of longer modalities and multi-step agents.

ByteDance Launches Groundbreaking AI Model Vidi2: 120 Billion Parameters, Revolutionizing Video Editing

ByteDance has launched the 120 billion parameter video understanding model Vidi2, which can process hours of raw footage, understand the narrative flow, and generate TikTok short videos or movie clips based on prompts. The core breakthrough is the fine-grained spatiotemporal grounding (STG) feature, which can identify spatiotemporal details in videos, and has the potential to revolutionize the video editing industry.

NVIDIA Unveils Multimodal LLM Describe Anything: Generating Detailed Descriptions of Specific Regions

The NVIDIA AI team has released a revolutionary multimodal large language model—Describe Anything 3B (DAM-3B)—designed for detailed, region-specific descriptions of images and videos. This model, with its innovative technology and superior performance, has generated significant discussion in the multimodal learning field, marking another milestone in AI development. Below, AIBase outlines the model's core highlights and industry impact. A breakthrough in region-specific descriptions, DAM-3B stands out for its unique ability to...

Ali International Open Source Ovis2 Series Multimodal Large Language Model with Six Versions

Ovis2 is the latest version of the Ovis series models proposed by Alibaba's international team. Compared to the previous version 1.6, Ovis2 has significant improvements in data construction and training methods. It not only enhances the capacity density of small models but also greatly improves chain of thought (CoT) reasoning capabilities through instruction fine-tuning and preference learning. Additionally, Ovis2 introduces video and multi-image processing capabilities, and enhances multilingual abilities and OCR capabilities in complex scenarios, significantly increasing the model's practicality.

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

NVIDIA has collaborated with several universities to introduce NVEagle, a large visual language model capable of chatting using images. NVEagle can analyze image content and provide accurate answers, such as identifying individuals in images, like Jensen Huang. The model significantly enhances the understanding of visual information by transforming images into visual tokens and combining them with text embeddings. In addressing the challenges of high-resolution image processing, the research team has constructed models like Eagle-X5-7B and Eagle-X by exploring various visual encoders and fusion strategies.

ByteDance Collaborates with HKUST to Release MMProLong: Long Document LMM Training Q&A Pairs are Far More Efficient than OCR Transcription

Related Recommendations

ByteDance Launches Groundbreaking AI Model Vidi2: 120 Billion Parameters, Revolutionizing Video Editing

AIDC AI Team Launches Ovis2.5: New Breakthrough in Economical Visual Reasoning Models

NVIDIA Unveils Multimodal LLM Describe Anything: Generating Detailed Descriptions of Specific Regions

Ali International Open Source Ovis2 Series Multimodal Large Language Model with Six Versions

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images