Today, as artificial intelligence continues to win top-level competitions, we seem to take for granted that these digital brains have completely surpassed humans. However, a new study jointly released by several leading institutions including UniPat AI, xbench, Alibaba, Moonshot AI, and StepZen has cast a cold bucket of water over this optimism. The results are astonishing: even Gemini 3 Pro Preview, which leads in this field, only slightly outperforms a three-year-old child, and still has a 20% capability gap when facing the cognitive level of a six-year-old.
This visual reasoning "closed-book exam" called BabyVision has exposed the shortcomings of large models in perceiving the physical world. While human infants can easily complete tasks like "spot the difference" or spatial puzzles, those AI giants that once laughed off mathematical challenges now struggle.
The "Language Trap": Why Can't AI See the World?
Why do large models with trillions of parameters get stuck on such basic visual tasks? The research found that the core issue lies in the fact that large models remain "language animals." When processing visual information, they tend to first translate images into text descriptions before performing logical reasoning. This "roundabout approach" works for macro-level concepts, but when dealing with visual features that cannot be precisely captured by words—such as slight curve deviations, complex geometric intersections, or subtle spatial occlusion relationships—the information is largely lost during the translation process.
Four "Disasters" in Visual Reasoning
The research team categorized the visual defects of large models into four dimensions through the BabyVision benchmark:
Missing Non-Verbal Fine Details: Large models often fail to distinguish pixel-level geometric differences, frequently choosing the wrong answer in puzzle matching due to an inability to "imagine" shape rotations and alignments.
Loss of Manifold Consistency: In long-distance connection or trajectory tracking tasks, large models act like children lost in a maze, easily getting "off track" and losing original perceptual clues when encountering path intersections.
Lack of Spatial Imagination: Text descriptions cannot accurately restore three-dimensional space, causing large models to frequently miscount layers or make projection errors when inferring side views of blocks or hidden volumes.
Visual Pattern Induction障碍: They tend to rigidly "count attributes" rather than understand changing patterns, struggling to abstract deep causal logic from a small number of visual examples.
Pain and Rebirth in Embodied Intelligence
This conclusion undoubtedly puts pressure on the currently hot "embodied intelligence" field. If an AI cannot even accurately recognize the physical environment around a six-year-old child, how can we expect it to safely assist humans in the real physical world?
To address this bottleneck, researchers proposed two evolutionary paths: one is to introduce reinforcement learning (RLVR), using explicit intermediate reasoning to mitigate perceptual uncertainty; the second is to fully embrace native multimodal reasoning, allowing models to learn direct "visual calculation" within the pixel space, like Sora 2, instead of relying on language.
This "evolutionary regression" study in AI's history reminds us that the path toward artificial general intelligence (AGI) may not lie in more difficult math problems, but in the puzzle games that a six-year-old can easily master.
Would you like to learn more about
