Skip to content

Browsing AI Agent built on Chrome/Selenium + OpenAI general-purpose/reasoning LLMs: Full control via OpenAI API (no LangChain), only screenshot inputs sent to LLM (no DOM/XPath), just 1000 lines of code, easily extensible

License

vlad-perederiy/simple_browsing_agent

Repository files navigation

Simple Browsing AI Agent

Intro

This repository provides a lightweight – yet fully functional – AI-driven browsing agent for Chrome.
It opens a given website in an isolated/dedicated Chrome window and automatically navigates it, step by step, to accomplish a specified task.
At each turn (step), the AI LLM:

  • checks whether its previous action was successful
  • decides on the next atomic action (see below) to take and explains its decision with a rationale
  • optionally, explains previous failures and summarizes past or plans future actions

The agent can be applied to a wide range of tasks, including:

  • Booking preparations
  • Form filling
  • Autonomous website testing
  • Information retrieval beyond standard search engines etc

Technical Features:

  • Uses Selenium for browser visualization and action implementation
  • Uses JSON with schema for action definitions with structured OpenAI output
  • Operates with general-purpose LLMs (not with specialized “computer-use” models)
  • Employs no Langchain or similar frameworks — the code maintains full control over LLM inputs / history via direct OpenAI-API calls
  • Sends to the LLM only what a human user/surfer would see: browsing-window screenshots and window titles.
    No DOM, XPath, or similar data is sent.
  • Uses, for each turn:
    • suitable prompts (mostly static)
    • plus a fine-tuned history of the agent’s past actions
    • plus 2 screenshots per active browser tab: taken before and after the preceding action
  • Overlays the screenshots with grids / labels / mouse pointers, for better LLM interpretation
  • Compact implementation (~1000 lines), easily extensible, optimized/tested for Windows

Supported atomic actions

  • Mouse click
  • Mouse click & drag
  • Mouse wheel (vertical and horizontal scrolls)
  • Text input (typing) and, experimentally, keyboard shortcuts
  • Browser navigation (tab selection, back/forward in history)

Configurable Settings

(see the beginning of the main code file)

  • Initial URL to start with, and Task to accomplish
  • OpenAI models to use
  • Browser screen size and zoom level
  • Screenshot overlay options (grid, labels, mouse pointer)
  • Action history length/depth submitted to LLM
  • Allowed key shortcuts
  • Stopping rules (e.g., max. number of failed actions)

Remarks

  • OpenAI o3 or gpt-5 reasoning models, with low reasoning effort and activated reasoning summary, typically perform best.
  • Each turn consumes ~2000-5000 input tokens (mostly cached from turn to turn), and ~200-500 output tokens.
  • The general-purpose LLMs excel at intent understanding (e.g. "to do X, I need to click Y"),
    but may occasionally struggle with pinpointing small UI elements (e.g., calendar date pickers).
    → Be forgiving! The agent usually resolves such issues by the second or third automatic attempt.

Example

Settings:

SITE    = "https://perederiy-consulting.de/"  
TASK    = """You are on the website of a freelancer.  
Navigate to the research section.
Look there for any papers related to the SABR model. 
Identify the most recent such paper.  
Output the first paragraph of its abstract.
"""

Output:

output example

About

Browsing AI Agent built on Chrome/Selenium + OpenAI general-purpose/reasoning LLMs: Full control via OpenAI API (no LangChain), only screenshot inputs sent to LLM (no DOM/XPath), just 1000 lines of code, easily extensible

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages