Article

Baidu Wenyin launches PaddleOCR-VL-1.6: Accuracy exceeds 96.33% and sets a new SOTA for document parsing

Published in Latest AI News

Time :Jun 2, 2026

Read :3minute

Baidu officially released the PaddleOCR-VL-1.6, a derivative model of the ERNIE Large Model. In the authoritative OmnicDocBench v1.6 evaluation, it achieved an accuracy of 96.33%, surpassing mainstream large models such as Gemini-3-Pro, GPT-5.2, and GLM-OCR, setting a new industry SOTA and ranking first in comprehensive performance globally. This release marks a significant breakthrough in multi-modal large models' ability to understand complex documents and analyze real-world scenarios.

As a core component of the ERNIE Large Model's multi-modal capabilities, PaddleOCR is trained on the ERNIE Large Model and currently supports over 100 languages, with users covering more than 170 countries and regions worldwide. The upgraded PaddleOCR-VL-1.6 maintains a lightweight architecture of 0.9B while significantly improving core recognition capabilities in complex scenarios such as tables, ancient texts, rare characters, seals, and charts through a data construction mechanism driven by the model and progressive training optimization.

In the Real5-OmniDocBench evaluation for real-world complex scenarios, the model also maintained its leading position with a total score of 93.19%, overcoming industry-recognized parsing challenges such as scanned documents, bent pages, screen photos, changes in lighting, and tilted documents.

Due to the continued use of the previous architecture, enterprises and developers can achieve smooth migration without additional adaptation. Currently, PaddleOCR has surpassed 79.2K stars on GitHub, exceeding Google's Tesseract OCR and becoming the most popular open-source OCR project globally. The new model is now available on the official website, with its code and weights open-sourced. In the current trend of large models evolving towards multi-modal depth, PaddleOCR-VL-1.6 not only provides a more efficient industrial-level solution for document digitization but will also further accelerate the deployment of AI in complex multi-modal scenarios.

Related Recommendations

Major Open Source Release! Native Multimodal LongCat-Next Released, Making Vision and Speech the Mother Tongue of AI

Global AI is undergoing an 'AI-native' technological shift. Addressing the current 'language-centric, externally patched vision or speech' architecture, a team released and open-sourced the native multimodal large model LongCat-Next and a discrete tokenizer, aiming to break modal barriers and enable AI to understand the physical world like processing text, achieved by reconstructing the underlying architecture.....

Jun 4, 2026

373.7k

Master Both Programming and GUI! Qwen3.7-Plus Makes Its Debut, Completing a Real APP Development in 11 Hours with Autonomous Closed-loop

Alibaba released the multimodal large model Qwen3.7-Plus, enhancing vision-language capabilities on top of text, unifying it as an agent foundation. It integrates GUI and CLI interactions for end-to-end automation from prototype to software engineering, achieving strong performance on the Vision Arena leaderboard.....

Jun 2, 2026

243.5k

NVIDIA Releases a Multimodal All-Round Model with Inference Efficiency 9 Times That of Competitors

Nvidia unveils Nemotron3Nano Omni, an open multimodal model integrating video, audio, image, and text reasoning with a 30B-A3B MoE architecture and built-in vision/audio encoders for faster, smarter developer interactions.....

Apr 29, 2026

229.4k

Meituan Launches Native Multimodal LongCat-Next: Visual and Speech Achieve Bottom-Level Unification

Meituan launches LongCat-Next, a native multimodal AI model that uses DiNA technology to unify images, audio, and text into discrete tokens, enabling deep integration of multimodal modeling for enhanced perception of the physical world.....

Apr 3, 2026

341.9k

Report says DeepSeek V4 will be released in April together with Tencent's Yao Shunyu's Mengyuan model

DeepSeek V4 and Yao Shunyu's new Hunyuan model will launch in April 2026. Led by Liang Wenfeng, DeepSeek V4 is a multimodal large model with enhanced code capabilities and long-term memory, focusing on visual content processing, AI search, and exploring 'conditional memory' mechanisms.....

Mar 16, 2026

348.5k

Intelligent Future, Your Artificial Intelligence Solution Think Tank

English 简体中文繁體中文にほんご