OpenHands-Versa: Coding Agents with Multimodal Browsing are Generalist Problem Solvers

This repository is a reference implementation of the paper Coding Agents with Multimodal Browsing are Generalist Problem Solvers containing scripts for reproducing the experiments in the paper.

🏆 OpenHands-Versa currently ranks #1 on SWE-Bench Multimodal and The Agent Company leaderboards.

Overview

Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.

Installation and LLM Configuration

OpenHands-Versa is built on top the OpenHands - a popular framework for open-source AI Agents and the installation instructions are similar as that of OpenHands. We require sudo access to the machine since experiments on The Agent Company need root privileges. All our experiments are run using Ubuntu OS (>=22.04) and we provide installation instructions for the same below:

1. Pre-requisites:

Docker
Conda
OS-specific dependencies:
- Ubuntu: build-essential => sudo apt-get install build-essential

Make sure you have all these dependencies installed before moving on to next steps.

2. Build and Setup The Environment

We recommend creation of a conda environment for installing dependencies as shown below:

# Install Python=3.12, nodejs>=22.x, and poetry
conda create -n oh_versa python=3.12
conda activate oh_versa
conda install -c conda-forge "nodejs>=22"
conda install conda-forge::poetry

From the root directory of the project, run the below command to ensure OpenHands-Versa is ready to run on your system:

make build

3. Configuring the Language Model

OpenHands-Versa supports a diverse array of Language Models (LMs) through the powerful litellm library. We use claude-3-7-sonnet-20250219 and claude-sonnet-4-20250514 for our experiments. You can configure the LLMs by creating a config.toml file in the project root directory similar to config_example.toml.

For details regarding support for other operating systems, support for other LLMs and debugging tips please refer to Development.md.

Reproducing Results

We benchmark OpenHands-Versa on three popular and challenging agent benchmarks: GAIA, The Agent Company, and SWE-Bench Multimodal. For instructions about reproducing our results, please refer to the respective README.md files for GAIA, The Agent Company and SWE-Bench Multimodal. Note that we use Tavily API for our search tool and running the experiments requires a search API key.

Note

The methodology in OpenHands-Versa has also been implemented in upstream OpenHands, and we recommend using the upstream repository if you want to use OpenHands Versa in your own work.

📜 License

Distributed under the MIT License. See LICENSE for more information.

📚 Cite

@misc{soni2025codingagentsmultimodalbrowsing,
      title={Coding Agents with Multimodal Browsing are Generalist Problem Solvers},
      author={Aditya Bharat Soni and Boxuan Li and Xingyao Wang and Valerie Chen and Graham Neubig},
      year={2025},
      eprint={2506.03011},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03011},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3,854 Commits
.github		.github
.openhands/microagents		.openhands/microagents
containers		containers
dev_config/python		dev_config/python
docs		docs
evaluation		evaluation
frontend		frontend
microagents		microagents
openhands		openhands
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.nvmrc		.nvmrc
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMMUNITY.md		COMMUNITY.md
CONTRIBUTING.md		CONTRIBUTING.md
CREDITS.md		CREDITS.md
Development.md		Development.md
ISSUE_TRIAGE.md		ISSUE_TRIAGE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
build.sh		build.sh
config.template.toml		config.template.toml
config_example.toml		config_example.toml
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pydoc-markdown.yml		pydoc-markdown.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

OpenHands-Versa: Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Overview

Installation and LLM Configuration

1. Pre-requisites:

2. Build and Setup The Environment

3. Configuring the Language Model

Reproducing Results

Note

📜 License

📚 Cite

About

Uh oh!

Releases

Packages

Contributors 268

Uh oh!

Languages

Uh oh!

License

Uh oh!

adityasoni9998/OpenHands-Versa

Folders and files

Latest commit

History

Repository files navigation

OpenHands-Versa: Coding Agents with Multimodal Browsing are Generalist Problem Solvers

Overview

Installation and LLM Configuration

1. Pre-requisites:

2. Build and Setup The Environment

3. Configuring the Language Model

Reproducing Results

Note

📜 License

📚 Cite

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 268

Uh oh!

Languages

Packages