On January 29, following the continuous release of spatial perception and VLA foundation models, Ant Lingbo Technology once again exceeded industry expectations by open-sourcing the world model LingBot-World. This model matches Google Genie3 in key indicators such as video quality, dynamism, long-term consistency, and interactive capabilities, aiming to provide a high-fidelity, highly dynamic, and real-time controllable "digital training ground" for embodied intelligence, autonomous driving, and game development.

(Figure description: LingBot-World is at the top level of the industry in terms of application scenarios, generation duration, dynamism, and resolution)
Regarding the most common issue in video generation, "long-term drift" (where objects may deform, details collapse, main subjects disappear, or scene structures break down after a long time), LingBot-World achieves nearly 10 minutes of continuous stable lossless generation through multi-stage training and parallelization acceleration, providing support for complex tasks with long sequences and multiple steps.
In terms of interactive performance, LingBot-World can achieve a generation throughput of approximately 16 FPS and keep end-to-end interaction latency within 1 second. Users can control characters and camera perspectives in real time via keyboard or mouse, with immediate visual feedback according to commands. Additionally, users can trigger environmental changes and world events through text, such as adjusting weather, changing the visual style, or generating specific events, and complete these changes while maintaining relatively consistent geometric relationships within the scene.

(Figure description: Consistency stress test, the camera moves away for up to 60 seconds and returns, the target object still exists and maintains structural consistency)

(Figure description: In a highly dynamic environment, the camera moves away for a long time and returns, the vehicle's shape and appearance remain consistent)

(Figure description: After the camera moves away for a long time and returns, the house still exists and maintains structural consistency)
The model has Zero-shot generalization capability. By inputting just one real photo (such as a city street view) or a game screenshot, it can generate an interactive video stream without additional training or data collection for a single scenario, thus reducing deployment and usage costs across different scenarios.
To address the lack of high-quality interactive data in world model training, LingBot-World adopts a hybrid data collection strategy: on one hand, it cleans large-scale network videos to cover a variety of scenarios, and on the other hand, it combines game collection with Unreal Engine (UE) synthesis pipeline, extracting clean visuals without UI interference directly from the rendering layer, and simultaneously recording operation instructions and camera poses, providing precise aligned training signals for the model to learn "how actions change the environment."
The large-scale deployment of embodied intelligence faces a core challenge—extremely scarce real-machine training data for complex long-term tasks. LingBot-World, with its long-term sequence consistency (i.e., memory ability), real-time interactive response, and understanding of the causal relationship between "actions and environmental changes," can "imagine" the physical world in the digital world, offering a low-cost, high-fidelity space for intelligent agents to experiment and learn. At the same time, LingBot-World supports diverse scene generation (such as changes in lighting and placement positions), which also helps improve the generalization ability of embodied intelligence algorithms in real-world scenarios.
With the continuous release of three large models in the "Lingbo" series for the embodied domain, Ant's AGI strategy has achieved a critical extension from the digital world to physical perception. This marks that the full-stack path of "foundation models - general applications - physical interactions" has become clear. Ant is opening up all models through the InclusionAI community, collaborating with the industry to explore the boundaries of AGI. An AGI ecosystem aimed at deep integration of open source and open collaboration, serving real-world scenarios, is accelerating its formation.
Currently, the model weights and inference code of LingBot-World are available to the community.
