Zhejiang University Alumni Collaborate with Microsoft to Launch Multimodal Model LLaVA, Challenging GPT-4V


The Zhipu team has open-sourced four core video generation technologies, including GLM-4.6V visual understanding, AutoGLM device control, GLM-ASR speech recognition, and GLM-TTS speech synthesis models, showcasing their latest progress in the multimodal field and laying the foundation for the development of video generation technology.
Google introduces image recognition features for NotebookLM, allowing users to upload whiteboard notes, textbooks, or table images, and automatically recognize text and perform semantic analysis. Users can directly search the content of images using natural language. This feature is free across all platforms and will soon add local processing options to protect privacy. The system uses multimodal technology to distinguish between handwritten and printed text, analyze table structures, and intelligently link with existing notes.
ByteDance and universities launch Sa2VA, integrating LLaVA for video understanding and SAM-2 for precise object segmentation, enhancing video analysis through complementary capabilities.....
"Alibaba has established a robotics and embodied AI team", led by executive Lin Junyang, aimed at developing innovative robotics technology and promoting the advancement of embodied AI. Embodied AI refers to intelligent systems that can interact with the environment through physical bodies, marking the company's further expansion in the field of intelligence.
During a technical livestream at 1 AM today, OpenAI officially launched its latest and most powerful multimodal models: o4-mini and the full-power o3. These models offer unique advantages, capable of processing text, images, and audio simultaneously. They also function as agents, automatically utilizing tools such as web search, image generation, and code parsing. Furthermore, they possess a deep thinking mode, enabling reasoning about images within a chain of thought.