Following the space perception and embodied large model, the Ant Lingbo team has officially open-sourced its interactive world model

LingBot-World addresses the core pain point of "scarce real-world data and high costs" in embodied intelligence training. By simulating physical laws in a virtual environment, agents can perform low-cost "trial and error" and transfer the learned causal relationships to the real world.
The model demonstrates several breakthrough technical features:
Long-term temporal consistency: Achieves nearly 10 minutes of continuous stable generation. Even if the camera is moved away for 60 seconds and then returned, the object structure and appearance in the scene remain consistent, effectively solving the "detail collapse" problem in video generation.
High-fidelity real-time interaction: Supports action-conditioned generation, with a generation throughput of approximately 16FPS, and end-to-end interaction latency controlled within 1 second. Users can change the environment in real time through keyboard, mouse, or text commands, such as adjusting weather or perspective.
Zero-shot generalization ability: Uses a hybrid data strategy, combining web videos and Unreal Engine (UE) synthetic pipelines for training. Users only need to input a single real city photo or game screenshot, and the model can generate an interactive video stream without additional training for specific scenarios.
Currently, the Ant Lingbo team has fully open-sourced the model weights and inference code of
Website:
https://technology.robbyant.com/lingbot-world
Model:
https://www.modelscope.cn/collections/Robbyant/LingBot-world
https://huggingface.co/collections/robbyant/lingbot-world
Code:
https://github.com/Robbyant/lingbot-world
Key points:
🌍 Digital Training Ground:
can simulate real physical causality, providing a low-cost trial-and-error space for AI robots.LingBot-World ⏱️ Super Long Memory: Supports logical consistency generation for up to 10 minutes, eliminating the "object deformation" phenomenon commonly found in long videos.
🎮 Real-Time Interaction: Has a generation rate of 16FPS, achieving millisecond-level action response and immediate environmental feedback.
🖼️ Minimal Deployment: Has Zero-shot capability, allowing a single photo to be "transformed" into an interactive 3D simulation world.
