Google has recently showcased significant upgrades to its generative image editing capabilities in the 17.10.54.sa.arm64 beta version of the Gemini Android app. This version introduces a deeply integrated markup interface and a real-time text description box, aiming to address the pain points of imprecise instruction delivery and broken operation workflows in current AI image re-creation, further enhancing Gemini's ability to fine-tune specific parts of generated content (such as Nano Banana images).

QQ20260318-104736.jpg

The core of this technical iteration lies in the reconstruction of the interaction logic. Compared to previous basic sketching support, which required users to exit the editing interface before giving instructions to the robot, the new interface allows users to directly make high-precision marks on specific areas of an image after clicking the "pencil" icon, while simultaneously entering modification intentions in the newly added text box at the bottom.

This dual-modal interaction approach of "visual positioning + natural language" significantly improves the model's accuracy in understanding specific local modification instructions. In addition, the beta version also reserves space for adjustment size (Resizing) and effects (Effects) options, indicating that Gemini is evolving from a single text-to-image tool into a comprehensive image workstation that integrates generation, trimming, and filter processing.

From an industry trend perspective, Google's move reflects that the focus of competition in generative AI is shifting from "creating something out of nothing" to "controlled editing with precision." By integrating complex markup tools into a native mobile application, Google aims to establish a higher interaction barrier in the fields of mobile AI photography and digital creation.

Although the above features are currently still in the code analysis phase and have not been officially released to the public, their demonstrated "mark and modify immediately" logic indicates a key step forward for multimodal models in perceiving users' refined aesthetic intentions, which will further accelerate the infiltration of AI painting from entertainment towards professional creative processes.