Skip to content

Code for the paper "Coding Agents with Multimodal Browsing are Generalist Problem Solvers"

License

adityasoni9998/OpenHands-Versa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

OpenHands-Versa: Coding Agents with Multimodal Browsing are Generalist Problem Solvers

This repository is a reference implementation of the paper Coding Agents with Multimodal Browsing are Generalist Problem Solvers containing scripts for reproducing the experiments in the paper.

🏆 OpenHands-Versa currently ranks #1 on SWE-Bench Multimodal and The Agent Company leaderboards.

Overview

Modern human labor is characterized by specialization; we train for years and develop particular tools that allow us to perform well across a variety of tasks. In addition, AI agents have been specialized for domains such as software engineering, web navigation, and workflow automation. However, this results in agents that are good for one thing but fail to generalize beyond their intended scope. One reason for this is that agent developers provide a highly specialized set of tools or make architectural decisions optimized for a specific use case or benchmark. In this work, we ask the question: what is the minimal set of general tools that can be used to achieve high performance across a diverse set of tasks? Our answer is OpenHands-Versa, a generalist agent built with a modest number of general tools: code editing and execution, web search, as well as multimodal web browsing and file access. Importantly, OpenHands-Versa demonstrates superior or competitive performance over leading specialized agents across three diverse and challenging benchmarks: SWE-Bench Multimodal, GAIA, and The Agent Company, outperforming the best-performing previously published results with absolute improvements in success rate of 9.1, 1.3, and 9.1 points respectively. Further, we show how existing state-of-the-art multi-agent systems fail to generalize beyond their target domains. These results demonstrate the feasibility of developing a generalist agent to solve diverse tasks and establish OpenHands-Versa as a strong baseline for future research.

Installation and LLM Configuration

OpenHands-Versa is built on top the OpenHands - a popular framework for open-source AI Agents and the installation instructions are similar as that of OpenHands. We require sudo access to the machine since experiments on The Agent Company need root privileges. All our experiments are run using Ubuntu OS (>=22.04) and we provide installation instructions for the same below:

1. Pre-requisites:

  • Docker
  • Conda
  • OS-specific dependencies:
    • Ubuntu: build-essential => sudo apt-get install build-essential

Make sure you have all these dependencies installed before moving on to next steps.

2. Build and Setup The Environment

We recommend creation of a conda environment for installing dependencies as shown below:

# Install Python=3.12, nodejs>=22.x, and poetry
conda create -n oh_versa python=3.12
conda activate oh_versa
conda install -c conda-forge "nodejs>=22"
conda install conda-forge::poetry

From the root directory of the project, run the below command to ensure OpenHands-Versa is ready to run on your system:

make build

3. Configuring the Language Model

OpenHands-Versa supports a diverse array of Language Models (LMs) through the powerful litellm library. We use claude-3-7-sonnet-20250219 and claude-sonnet-4-20250514 for our experiments. You can configure the LLMs by creating a config.toml file in the project root directory similar to config_example.toml.

For details regarding support for other operating systems, support for other LLMs and debugging tips please refer to Development.md.

Reproducing Results

We benchmark OpenHands-Versa on three popular and challenging agent benchmarks: GAIA, The Agent Company, and SWE-Bench Multimodal. For instructions about reproducing our results, please refer to the respective README.md files for GAIA, The Agent Company and SWE-Bench Multimodal. Note that we use Tavily API for our search tool and running the experiments requires a search API key.

Note

The methodology in OpenHands-Versa has also been implemented in upstream OpenHands, and we recommend using the upstream repository if you want to use OpenHands Versa in your own work.

📜 License

Distributed under the MIT License. See LICENSE for more information.

📚 Cite

@misc{soni2025codingagentsmultimodalbrowsing,
      title={Coding Agents with Multimodal Browsing are Generalist Problem Solvers},
      author={Aditya Bharat Soni and Boxuan Li and Xingyao Wang and Valerie Chen and Graham Neubig},
      year={2025},
      eprint={2506.03011},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.03011},
}

About

Code for the paper "Coding Agents with Multimodal Browsing are Generalist Problem Solvers"

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 268