On May 24, the ByteDance Seed team collaborated with the Hong Kong University of Science and Technology to release a new research result on long document training for multimodal large language models (LMMs). Researchers built a new model called MMProLong based on Alibaba's open-source Qwen2.5-VL and achieved breakthrough progress in the efficiency of long document processing. This study not only breaks the traditional path of long text training in multimodal models but also reveals the key impact of data organization on the model's long context capability.

The core findings of this research directly address the pain points in current LMM training: in multimodal long document training, question-and-answer (QA) training for specific objectives is significantly more effective than traditional optical character recognition (OCR) transcription. Experiments show that using pure text transcription as a training task not only fails to improve the model's ability to locate content in long contexts but also leads to performance degradation. However, training using long-context QA pairs generated by an independent model (such as ByteDance Seed2.0) can guide the model to accurately retrieve target paragraphs amid lengthy and distracting information.

Based on this optimized strategy, MMProLong demonstrates strong long-text stability with a limited training budget of only 128,000 tokens, maintaining performance even when input length reaches 256,000 or 512,000 tokens. It significantly outperforms larger open-source models such as InternVL3-38B and Gemma3-27B on the MMLongBench and MM-NIAH (Needle-in-a-Haystack) benchmarks. In addition, the multimodal capabilities of MMProLong have been successfully transferred to long video understanding tasks that were not specifically trained, and the effectiveness of this strategy was also validated on the Qwen3-VL-8B model.

This study provides an alternative development path for the current large model industry, different from DeepSeek (which upgrades architecture through highly compressed and re-ordered visual information). It proves that long context capabilities can be significantly improved by optimizing the structure of training data rather than modifying the underlying architecture, opening up more economically and efficiently feasible technical possibilities for future developments of longer modalities and multi-step agents.