Robotics technology is undergoing a fundamental transformation. The Gemini Robotics project recently released by Google DeepMind showcases two new models that work together, marking the first time a robotic system has been able to "think" before taking action. This breakthrough could revolutionize the current limitations of robots that can only perform specific tasks.

Generative AI technology has become commonplace in text, image, audio, and video creation. Now, the same technology is being applied to generate robot action instructions. The DeepMind team believes that generative AI holds unique importance for robotics because it can unlock general-purpose capabilities.

The core issue facing current robots is over-specialization. Each robot requires intensive training for specific tasks and performs poorly when handling others. Carolina Parada, head of Google DeepMind's robotics division, said: "Today's robots are highly customized and difficult to deploy, often requiring months to install a robot unit that can only perform a single task."

image.png

The fundamental characteristics of generative systems make AI-driven robots more versatile. They can face new environments and workspaces without reprogramming. The current DeepMind robotics approach relies on the collaboration of two models: one responsible for thinking, and the other for execution.

The two new models are named Gemini Robotics 1.5 and Gemini Robotics-ER 1.5. The former is a vision-language-action model that generates robot action instructions using visual and text data. The "ER" in the latter stands for embodied reasoning, a vision-language model that receives visual and text input and generates the steps needed to complete complex tasks.

Gemini Robotics-ER 1.5 is the first robotic AI system with the ability to simulate reasoning, similar to the reasoning process of modern text chatbots. DeepMind refers to this as "thinking" ability, although the term may not be entirely accurate in the field of generative AI. According to DeepMind, the ER model achieved top results in academic and internal benchmark tests, indicating that it can make accurate decisions about how to interact with physical space. However, it does not perform any actions itself, which requires the cooperation of Gemini Robotics 1.5.

For example, when a robot needs to sort a pile of clothes into white and colored categories, Gemini Robotics-ER 1.5 processes the request and analyzes the image of the physical environment. This AI system can also use tools like Google Search to collect additional data. Then, the ER model generates natural language instructions to provide the robot with specific steps to complete the task.

The innovation of this dual-model architecture lies in separating reasoning and execution. The reasoning model focuses on understanding task requirements and environmental conditions, developing detailed action plans; the execution model is responsible for converting these plans into specific robot actions. This division of labor allows the robot system to have complex thinking abilities while maintaining precise execution efficiency.

From a technological development perspective, this breakthrough may mark an important turning point in robotics from specialization to generalization. Traditional robots require extensive training and debugging for each new task, while robots with generative AI capabilities can theoretically quickly adapt to new work environments through natural language instructions.

Of course, this technology is still in its early stages, and various challenges may arise during actual deployment. The performance of robots in complex real-world environments, safety guarantees, and cost control issues need to be further addressed. However, DeepMind's attempt undoubtedly points to a promising direction for the future development of robotics technology.

As AI technology continues to advance, we may soon witness a historic moment where robots transition from simple task executors to true intelligent assistants.