Google has recently announced a major upgrade to the file search feature in the Gemini API, aimed at providing developers with more comprehensive multi-modal retrieval-augmented generation (RAG) capabilities. This update not only breaks the limitations of traditional text retrieval but also expands AI's understanding to include images and deep integration with complex documents, marking a critical step forward in the accuracy of enterprise-level AI applications for information retrieval.

On the technical side, the new file search function is built on the Gemini Embedding2 model. Unlike previous systems that relied solely on text vector search, the upgraded system has unified multi-modal embedding capabilities, allowing it to recognize and process visual information in PDFs, documents, and various types of images. This means developers no longer need to spend time building complex vector databases or document segmentation systems, as they can now achieve a complete RAG workflow—from data upload to information retrieval—within the Gemini API itself.

image.png

In practical application scenarios, this advancement addresses the pain point of traditional RAG systems being unable to handle non-text content. In the past, charts, design diagrams, or product screenshots in documents often became "blind spots" for AI, leading to missing key context in answers. Now, the Gemini API can natively understand these visual elements. For example, when a company uploads a PDF containing technical architecture diagrams or sales trend charts, the AI can combine chart data with textual descriptions to provide accurate inferences, significantly enhancing the practicality of customer service robots and document analysis systems.

To further optimize the management efficiency of large knowledge bases, Google has also introduced a custom metadata filtering feature. Developers can add tags to files based on dimensions such as department, time, and category. During retrieval, they can filter out irrelevant information using predefined conditions, ensuring that AI-generated answers are more focused.

In addition, to address users' most concerning issue of information traceability, the Gemini API now supports page-level citations. When generating answers, the AI will clearly indicate the specific page number from which the information originates, rather than just pointing to the entire file. This increase in transparency not only helps users quickly verify the accuracy of content but also provides convenience for in-depth reading.

Currently, this enhanced file search feature is available to developers worldwide. Users can access it through Google AI Studio or the Google Cloud platform to experience the development convenience and efficiency improvements brought by multi-modal RAG.