Recently, ByteDance (ByteDance Research) officially open-sourced its native unified multimodal large model - Lance.

In the current AI industry, where models are often built with hundreds of billions or even trillions of parameters, or simply "assembled" by combining components, the emergence of Lance is a significant breakthrough: it not only achieves full functionality with an extremely lightweight activation parameter count of 3B (3 billion), but also breaks through the long-standing technical barriers between "understanding models (VLM)" and "generative models (DiT/Diffusion)".

image.png

Key Highlights:

  • Natively Unified: Refuses "stitching", from scratch training, integrating image/video understanding, generation, and cross-modal editing into one model system.

  • All-Round Performance: A single model perfectly closes the loop for $X \rightarrow T$ (text/video understanding), $X \rightarrow I$ (image generation/editing), and $X \rightarrow V$ (video generation/editing) - three core output tasks.

  • Open Source & Free: Uses an extremely friendly Apache 2.0 license. The weights are fully available on Hugging Face, and even a modest budget of 128 A100 GPUs can run the entire process.

Technical Breakdown: How Does It Achieve "Synchronization" Between Contrasting Demands?

In traditional AI architectures, the "understanding" and "generation" capabilities of large models have always been contradictory: understanding tasks require removing noise and extracting high-level semantic features, while generation tasks need to focus on textures, geometric structures, and temporal dynamics at a low level.

To tackle this industry-wide challenge, Lance introduces an extremely clever design of "shared context + parallel capability decoupling":

1. Unified Interleaved Sequence and Dual-Stream Expert Architecture

Before entering the model, all text, image, and video inputs are first split and converted into a unified "interleaved sequence". This sequence is then fed into the Dual-Stream MoE (Multi-Expert) architecture, allowing the experts responsible for "understanding" and "generation" to work separately, perfectly solving the conflict in capabilities.

  • Understanding Side: Text tokens and visual input rely on the embedding layer of Qwen2.5-VL and the ViT encoder, accurately extracting high-level semantic visual tokens.

  • Generation Side: Visual input is compressed by Wan2.2's powerful 3D causal VAE, achieving $16\times$ spatial down-sampling and $4\times$ temporal down-sampling, preserving the most detailed dynamic continuous representation.

2. MaPE (Modal-Aware Rotational Position Encoding)

When a long sequence contains mixed visual tokens such as images, text, and videos, it's easy to generate "boundary confusion" hallucinations. Lance innovates the MaPE mechanism, adding fixed time offsets to different modal groups. This elegant design allows the model to have strong spatial and temporal boundary identification without disrupting the internal spatial structure and temporal order of images and videos.

[Unified Interleaved Sequence] ──► [MaPE Modal Boundary Isolation] ──► [Dual-Stream Expert Architecture (MoE)]

Four-Phase Extreme Training: A "Lean Battle" with 128 GPUs

Compared to the "brute-force aesthetics" of major companies burning thousands of GPUs, Lance's training process demonstrates high financial responsibility. The entire lifecycle is strictly limited within a maximum budget of 128 GPUs, advancing through four tightly connected phases with precision:

  • Phase 1: Pre-training (1.5T Tokens) —— Devour 1B image-text pairs and 140M video-text pairs to lay a solid foundation for multimodal understanding.

  • Phase 2: Continuous Training (300B Tokens) —— Introduce editing, subject-driven generation, and multi-modal understanding data to activate multi-task synergy.

  • Phase 3: Supervised Fine-tuning (72B Tokens) —— Inject human instructions extensively, focusing on instruction following and visual identity consistency.

  • Phase 4: Reinforcement Learning (GRPO Algorithm) —— Use group-based relative policy optimization and, notably, adopt PaddleOCR as a reward model to specifically address AI issues like inaccurate text rendering and misalignment between text and images in pictures.

Outstanding Results: 3B Model Defeats 7B Giants Across Multiple Domains

Thanks to the cross-task data collaboration effect (the model deepens its understanding while learning to generate, and enhances its generation capabilities while learning to understand), the 3B-sized Lance achieved remarkable performance in various hard benchmarks:

  • Video Generation (VBench): Scored 85.11 points! Not only did it surpass the similar all-in-one model TUNA (84.06), but it also directly outperformed specialized video generation models such as HunyuanVideo (83.33) and Wan2.1-T2V (83.69).

  • Image Generation (GenEval): Scored 0.90, firmly securing a top position among global open-source models.

  • Video Understanding (MVBench): Scored 62.0 points, far exceeding the dedicated understanding model Show-o2 (7B, 55.7 points), which is twice the size of Lance.

Industry Shock: Deployment Costs for Multimodal Applications Will Plunge Dramatically

Lance's open source is a significant industry disruption, especially for the currently booming fields of AI short films, intelligent agents (Agent) collaboration, and interactive media.

Previously, developing an AI tool that could understand scripts, generate storyboards, and modify visuals in real-time while maintaining character consistency required multiple large models to be deployed, scheduled, and stitched together in the background (one for VLM semantics, one for Diffusion image generation, and one for temporal video). This not only caused system lag, but even aligning the pipelines between multiple models was enough to drive developers crazy.

Now, Lance3B achieves "left eye sees, right eye edits, both hands create" with a single model