Google has recently introduced a powerful feature called "Agentic Vision (Agent Vision)" for its lightweight model Gemini 3 Flash. This upgrade breaks the previous limitation of AI vision models that only "guess after a quick glance," allowing AI to analyze images like human experts by actively exploring and deeply reasoning.

image.png

Previously, when dealing with information-rich images (such as distant road signs, complex circuit diagrams, or small text), AI often lost details because it could only process global information at once. Agentic Vision introduces a "think, act, observe" loop mechanism. In simple terms, when users present a complex visual question, Gemini 3 will first create an analysis plan, then use automatically generated and executed Python code to crop, rotate, or annotate parts of the image, and finally provide the final answer based on these high-definition details.

This investigative work mode has improved Gemini's accuracy by 5% to 10% when handling difficult visual tasks. It is no longer simply identifying pixels; instead, it has learned to "zoom in" to find evidence as needed.

Currently, this capability is available first on the Gemini AI Studio and Vertex AI platforms. Developers can simply enable the "code execution" feature to use it. Google stated that this feature will also be made available directly to general users through the "Thinking Mode" in the future, enabling mobile AI assistants to have this deep visual reasoning ability.

Key points:

  • 👁️ Google has launched the Agentic Vision technology, combining visual reasoning with Python code execution, moving away from traditional static image recognition modes.

  • 🔍 Introduces a "cyclic analysis" mechanism, allowing AI to independently crop, zoom in, and annotate images, significantly improving the accuracy of identifying complex details.

  • 🛠️ This feature is now available to developers via API, and will be integrated into the "Thinking Mode" of the Gemini app for general users in the future.