AI2 Releases Fully Open-Source Web Agent MolmoWeb: Control Web Pages With "Vision" Alone

robot
Abstract generation in progress

Allen Institute for Artificial Intelligence (AI2) recently released a groundbreaking fully open-source web proxy called MolmoWeb. Unlike traditional proxies that rely on webpage underlying code (DOM), MolmoWeb makes decisions solely based on screenshots, marking a major leap in “vision-driven” web navigation technology.

Core Technology: “Seeing” Webpages Like Humans

MolmoWeb’s operation is straightforward: it captures a screenshot of the current browser window, analyzes it visually to decide the next action (such as clicking, scrolling, flipping pages), then executes the action and repeats. This “what you see is what you get” approach makes it more robust than traditional proxies because visual layout is generally more stable than underlying code, and its decision process is fully transparent and explainable to human users.

Performance Breakthrough: Small Models Outperform Giants

Although MolmoWeb’s parameter sizes are only 4B and 8B, it demonstrates impressive performance:

Leading in Benchmarks: In WebVoyager testing, the 8B version scored as high as 78.2%, ranking at the top among open-source models and approaching OpenAI’s proprietary model o3 (79.3%).

Huge Potential: Research shows that by running tasks multiple times and selecting the best results, success rates can further increase to 94.7%.

Precise Targeting: In UI element localization benchmarks, it even surpasses Anthropic’s Claude3.7.

Data Support: The Largest Open Data Set in History

AI2 not only open-sourced the model weights but also contributed a massive dataset called MolmoWebMix. This dataset includes:

  • 36,000 real browsing tasks completed by human volunteers.

  • Over 2.2 million screenshots-question-answer pairs.

  • Automated synthetic data verified by GPT-4o. Experiments show that synthetic data can even outperform human trajectories in guiding agents to find the “optimal path.”

Open-Source Spirit and Future Challenges

Currently, MolmoWeb is fully open on Hugging Face and GitHub under the Apache 2.0 license. While challenges remain in handling complex commands, login authentication, and legal compliance (such as terms of service), AI2 firmly believes that only through complete transparency and community collaboration can we truly counteract the data monopolies of large tech companies.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin