In May 2026, the field of embodied intelligence in China witnessed a major technological breakthrough. X Square Robot officially announced the open-source release of its latest VLA (Vision-Language-Action) model, Wall-OSS-0.5. This model broke the long-standing industry tradition of "pre-exam fine-tuning," achieving a breakthrough in "zero-shot" deployment on real robots without any task-specific fine-tuning.

Large Model Metaverse (1)

Breaking the Industry Stalemate: From "Custom Scripts" to "General Intelligence"

For a long time, the field of embodied intelligence has faced an unspoken dilemma: most models require large-scale fine-tuning for specific tasks before testing their performance. This makes it difficult to determine whether a model truly possesses the generalization ability of a "general brain" or just knows a specific "operational script."

X Square Robot provided a new answer through Wall-OSS-0.5. The model was pre-trained on over 20 types of robot forms, millions of trajectory data, and a multimodal corpus of 90 million entries. Without any task-specific fine-tuning, the team directly deployed it on real robots, testing 17 challenging tasks including semantic understanding, rigid/flexible object manipulation, and precise operations.

Key Highlights: A Leap in Pre-training Model Performance

Test data shows that Wall-OSS-0.5 exceeded expectations:

  • Zero-Shot Deployment Capability: Without fine-tuning, a version of the model with 400k pre-training steps achieved scores above 80 out of 100 in four of 17 zero-shot tasks, even scoring 82 in the "tightening the rope" task, a flexible object task never seen during pre-training.

  • Significantly Improved Fine-tuning Upper Limit: In scenarios requiring targeted fine-tuning, Wall-OSS-0.5 demonstrated high learning efficiency. Compared to the industry benchmark π0.5, under the same data budget, Wall-OSS-0.5 led by an average of 17.5 points and showed almost an order-of-magnitude improvement in success rate for precision operation tasks like precise insertion.

  • "Ability Reformation" Rather Than Degradation: Experiments show that after intensive action training, the model's multimodal perception capabilities not only remained intact but also experienced a "reformative" evolution in visual positioning and reasoning abilities.

Four Key Technologies Build a Moat

The outstanding performance of Wall-OSS-0.5 is due to four fundamental technology innovations from the team:

  1. Gradient Bridging: Directly injecting action supervision signals into the pre-training backbone, enabling the model to unify "seeing, speaking, and acting" at the underlying representation level.

  2. Visual Alignment Tokenizer: Ensuring each action token carries clear visual semantics, giving the model true "physical meaning" inference capability.

  3. Action Space Supervision: Focusing the training on the overall structure of the trajectory rather than trivial high-frequency details, significantly improving convergence efficiency.

  4. DMuon Distributed Optimization: Through low-level system optimization, the research team reduced heterogeneous computing costs by 100 times, making this complex training formula practically feasible on large-scale clusters.

A Milestone in Embodied Intelligence

Currently, X Square Robot has fully open-sourced the related model weights, training code, and dataset interfaces of Wall-OSS-0.5.

Industry analysts point out that the emergence of Wall-OSS-0.5 is not just a simple model update; it redefines the development paradigm of embodied intelligence, shifting from solely pursuing "single-task success rate" to "general physical intuition transfer." For researchers and developers, this marks the beginning of a new phase for embodied intelligence foundation models—characterized by "reproducibility, verifiability, and challengeability"—which will greatly accelerate the deployment of general-purpose robots in complex real environments.