Allan Artificial Intelligence Institute (AI2) recently released a breakthrough fully open-source web agent
Core Technology: Seeing Web Pages Like Humans
The operation logic of MolmoWeb is very intuitive: it captures a screenshot of the current browser window, uses visual analysis to determine the next action (such as clicking, scrolling, or paginating), then executes and repeats. This "what you see is what you get" model makes it more robust than traditional agents because the visual layout of a webpage is usually more stable than its underlying code, and its decision-making process is completely transparent and explainable to human users.

Performance Leap: Small Models Beat Big Ones
Although the parameter scale of MolmoWeb is only 4B and 8B, it has shown the strength of "small but powerful" in performance:
Leading the Rankings: In the WebVoyager test, the 8B version achieved a score of 78.2%, not only ranking first among open-source models, but also approaching OpenAI's proprietary model o3 (79.3%).
Great Potential: Research found that by running tasks multiple times and selecting the best results, its success rate can be further increased to 94.7%.
Precise Positioning: In UI element positioning benchmark tests, it even surpassed Anthropic's Claude3.7.
Data Support: The Largest Open Dataset in History
Not only did AI2 open-source the model weights this time, but it also contributed a large dataset named MolmoWebMix. This dataset includes:
36,000 real browsing tasks completed by human volunteers.
More than 2.2 million screen shot-questions pairs.
Automated synthetic data verified by GPT-4o. Experiments have shown that synthetic data can even outperform human trajectories in guiding the agent to find the "optimal path."

Open Source Spirit and Future Challenges
Currently, MolmoWeb is fully open-sourced on
