In the field of computer vision, how to enable AI to observe and describe every corner of an image as humans do has always been a challenge. Recently, Apple Inc. jointly with the University of Wisconsin-Madison officially released a new AI training framework called
This framework is designed for "dense image description," aiming to allow AI to precisely capture and explain image details such as "a red apple on the table" or "a pedestrian in the distance," rather than just providing general summaries.

Reinforcement Learning with a Big Impact: Qwen2.5 Acts as the "Referee"
Traditional image annotation often relies on expensive human labor or large models that are prone to hallucination, leading to inconsistent data quality. The Apple research team solved this issue through an innovative reinforcement learning mechanism. The system first uses GPT-5 and Gemini 2.5 Pro to generate candidate descriptions, then Gemini 2.5 Pro refines the scoring criteria, and Qwen2.5 model acts as the referee to provide scores and feedback.
This structured and precise feedback allows the model to clearly perceive and correct errors during training, achieving higher descriptive accuracy with a smaller parameter scale.
The Victory of Compact Models: Low Hallucination Rate Outperforms Trillion-parameter Models
The RubiCap series models (ranging from 2 billion to 7 billion parameters) trained based on this framework demonstrated remarkable efficiency in testing. Experimental data show that the 7-billion-parameter RubiCap model ranked highest in blind tests, with a "hallucination" error rate even lower than a cutting-edge large model with 720 billion parameters. More surprisingly, the 3-billion-parameter mini version outperformed the 7-billion-parameter version on some metrics.
