Google Gemini 3 Flash Upgrade: Introducing Agentic Vision for AI to Deeply Analyze Images Like Experts

Google has recently introduced a powerful feature called "Agentic Vision (Agent Vision)" for its lightweight model Gemini 3 Flash. This upgrade breaks the previous limitation of AI vision models that only "guess after a quick glance," allowing AI to analyze images like human experts by actively exploring and deeply reasoning.

Previously, when dealing with information-rich images (such as distant road signs, complex circuit diagrams, or small text), AI often lost details because it could only process global information at once. Agentic Vision introduces a "think, act, observe" loop mechanism. In simple terms, when users present a complex visual question, Gemini 3 will first create an analysis plan, then use automatically generated and executed Python code to crop, rotate, or annotate parts of the image, and finally provide the final answer based on these high-definition details.

This investigative work mode has improved Gemini's accuracy by 5% to 10% when handling difficult visual tasks. It is no longer simply identifying pixels; instead, it has learned to "zoom in" to find evidence as needed.

Currently, this capability is available first on the Gemini AI Studio and Vertex AI platforms. Developers can simply enable the "code execution" feature to use it. Google stated that this feature will also be made available directly to general users through the "Thinking Mode" in the future, enabling mobile AI assistants to have this deep visual reasoning ability.

Key points:

👁️ Google has launched the Agentic Vision technology, combining visual reasoning with Python code execution, moving away from traditional static image recognition modes.
🔍 Introduces a "cyclic analysis" mechanism, allowing AI to independently crop, zoom in, and annotate images, significantly improving the accuracy of identifying complex details.
🛠️ This feature is now available to developers via API, and will be integrated into the "Thinking Mode" of the Gemini app for general users in the future.

ByteDance Launches StoryMem: Equipping AI Videos with Long-Term Memory, Completely Solving the Problem of Character Consistency

ByteDance and Nanyang Technological University jointly launched the StoryMem system, which effectively solves the problem of character inconsistency and environmental flickering in AI video generation by introducing a 'hybrid memory bank' mechanism similar to human memory, achieving high consistency in long video cross-scene creation.

Rent a Robot Like Renting a Power Bank! The First Open-Source Platform 'Jingtianzu' Launched in Shanghai

The first open-source robot rental platform, 'Jingtianzu,' was launched in Shanghai, promoting the development of robot services from scattered rentals to an ecosystem. The platform lowers the usage barriers through an innovative model, offering a convenient experience similar to shared power banks. It has covered 50 core cities, integrated over 600 service providers, and provides robot rental services with multiple brands and models.

Apple Collaborates with Purdue University to Develop DarkDiff Technology: Capturing Night Vision Quality Photos Even in Extremely Low Light

Apple and Purdue University have developed the DarkDiff technology, which enhances smartphone photography in extremely low light conditions by integrating a generative diffusion model into the camera image processing workflow. This technology processes raw image data directly, effectively solving issues such as detail blurring and artificiality caused by traditional night scene noise reduction, enabling the capture of clear details in the dark.

xAI Launches Grok Voice Agent API: Only $0.05 per Minute, Ranking First in Audio Inference Benchmark!

xAI launches Grok Voice Agent API, providing real-time voice interaction capabilities to developers worldwide. This API is based on a mature voice technology stack and has been widely applied in Tesla vehicles and mobile applications. Its biggest highlight is extreme cost-effectiveness, with a connection cost of only $0.05 per minute, significantly lower than mainstream competitors in the market, helping developers build high-performance voice applications at a low cost.

Google Gemini 3 Flash Upgrade: Introducing Agentic Vision for AI to Deeply Analyze Images Like Experts

Related Recommendations

Ant Group Lingbo Technology Opens Source Embodied Large Model LingBot-VLA Post-Training Toolchain

ByteDance Launches StoryMem: Equipping AI Videos with Long-Term Memory, Completely Solving the Problem of Character Consistency

Rent a Robot Like Renting a Power Bank! The First Open-Source Platform 'Jingtianzu' Launched in Shanghai

Apple Collaborates with Purdue University to Develop DarkDiff Technology: Capturing Night Vision Quality Photos Even in Extremely Low Light

xAI Launches Grok Voice Agent API: Only $0.05 per Minute, Ranking First in Audio Inference Benchmark!