The field of Embodied AI has made a major breakthrough today. Xiaomi has officially open-sourced its first-generation robot model Xiaomi-Robotics-0. This model has 4.7 billion parameters, aiming to solve the problem of slow robot movements caused by inference delays in existing VLA (Vision-Language-Action) models, achieving real-time inference and efficient generalization on consumer-grade GPUs.

QQ20260212-141446.png

Core Architecture: Collaboration between Brain and Cerebellum

To balance general understanding and high-frequency control, Xiaomi-Robotics-0 adopts an innovative MoT (Mixture-of-Transformers) hybrid architecture:

  • Vision-Language Brain (VLM): As the base, it is responsible for interpreting ambiguous human instructions and capturing spatial relationships in high-definition vision.

  • Action Execution Cerebellum (Action Expert): Embedded with multiple layers of Diffusion Transformer (DiT), it generates precise "action chunks" through flow matching technology, ensuring flexibility in physical execution.

Training Secrets: A Two-Stage Evolutionary Theory

The Xiaomi R&D team balanced the model's common sense understanding and physical operation capabilities through a rigorous training formula:

  1. Cross-modal Pre-training: Introducing the Action Proposal mechanism allows the VLM to maintain logical reasoning ability while aligning feature space and action space. After that, the VLM is frozen and DiT is specifically trained to generate smooth action sequences.

  2. Post-training: To address the "action discontinuity" issue during real machine operation, an asynchronous inference mode is used. Combining Clean Action Prefix (ensuring continuous trajectory) and Λ-shape Attention Mask (forcing attention to current visual feedback), the robot gains strong response agility when facing sudden environmental changes.

QQ20260212-142413.png

Practical Performance: Breaking Multiple SOTA Records

In testing, Xiaomi-Robotics-0 demonstrated dominant performance:

  • Simulation Benchmark: In three major simulation tests, LIBERO, CALVIN, and SimplerEnv, it defeated 30 comparison models and achieved the best current results (SOTA).

  • Real Machine Generalization: On a dual-arm robot platform, whether disassembling blocks or folding flexible towels, the model showed high hand-eye coordination and physical generalization ability.

Open Source Ecosystem

Xiaomi has fully opened up technical resources this time, including technical homepage, open source code, and model weights released on Hugging Face, aiming to jointly push the boundaries of embodied intelligence through community efforts.

  • Technical Homepage: https://xiaomi-robotics-0.github.io
  • Open Source Code: https://github.com/XiaomiRobotics/Xiaomi-Robotics-0
  • Model Weights: https://huggingface.co/XiaomiRobotics