This repository is a reference implementation of the paper Coding Agents with Multimodal Browsing are Generalist Problem Solvers containing scripts for reproducing the experiments in the paper.
🏆 OpenHands-Versa currently ranks #1 on SWE-Bench Multimodal and The Agent Company leaderboards.
Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.
OpenHands-Versa is built on top the OpenHands - a popular framework for open-source AI Agents and the installation instructions are similar as that of OpenHands. We require sudo access to the machine since experiments on The Agent Company need root privileges. All our experiments are run using Ubuntu OS (>=22.04) and we provide installation instructions for the same below:
- Docker
- Conda
- OS-specific dependencies:
- Ubuntu: build-essential =>
sudo apt-get install build-essential
- Ubuntu: build-essential =>
Make sure you have all these dependencies installed before moving on to next steps.
We recommend creation of a conda environment for installing dependencies as shown below:
# Install Python=3.12, nodejs>=22.x, and poetry
conda create -n oh_versa python=3.12
conda activate oh_versa
conda install -c conda-forge "nodejs>=22"
conda install conda-forge::poetryFrom the root directory of the project, run the below command to ensure OpenHands-Versa is ready to run on your system:
make buildOpenHands-Versa supports a diverse array of Language Models (LMs) through the powerful litellm library. We use claude-3-7-sonnet-20250219 and claude-sonnet-4-20250514 for our experiments. You can configure the LLMs by creating a config.toml file in the project root directory similar to config_example.toml.
For details regarding support for other operating systems, support for other LLMs and debugging tips please refer to Development.md.
We benchmark OpenHands-Versa on three popular and challenging agent benchmarks: GAIA, The Agent Company, and SWE-Bench Multimodal. For instructions about reproducing our results, please refer to the respective README.md files for GAIA, The Agent Company and SWE-Bench Multimodal. Note that we use Tavily API for our search tool and running the experiments requires a search API key.
The methodology in OpenHands-Versa has also been implemented in upstream OpenHands, and we recommend using the upstream repository if you want to use OpenHands Versa in your own work.
Distributed under the MIT License. See LICENSE for more information.
@misc{soni2025codingagentsmultimodalbrowsing,
title={Coding Agents with Multimodal Browsing are Generalist Problem Solvers},
author={Aditya Bharat Soni and Boxuan Li and Xingyao Wang and Valerie Chen and Graham Neubig},
year={2025},
eprint={2506.03011},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.03011},
}
