This repository provides a lightweight – yet fully functional – AI-driven browsing agent for Chrome.
It opens a given website in an isolated/dedicated Chrome window and automatically navigates it,
step by step, to accomplish a specified task.
At each turn (step), the AI LLM:
- checks whether its previous action was successful
- decides on the next atomic action (see below) to take and explains its decision with a rationale
- optionally, explains previous failures and summarizes past or plans future actions
The agent can be applied to a wide range of tasks, including:
- Booking preparations
- Form filling
- Autonomous website testing
- Information retrieval beyond standard search engines etc
- Uses Selenium for browser visualization and action implementation
- Uses JSON with schema for action definitions with structured OpenAI output
- Operates with general-purpose LLMs (not with specialized “computer-use” models)
- Employs no Langchain or similar frameworks — the code maintains full control over LLM inputs / history via direct OpenAI-API calls
- Sends to the LLM only what a human user/surfer would see: browsing-window screenshots and window titles.
No DOM, XPath, or similar data is sent. - Uses, for each turn:
- suitable prompts (mostly static)
- plus a fine-tuned history of the agent’s past actions
- plus 2 screenshots per active browser tab: taken before and after the preceding action
- Overlays the screenshots with grids / labels / mouse pointers, for better LLM interpretation
- Compact implementation (~1000 lines), easily extensible, optimized/tested for Windows
- Mouse click
- Mouse click & drag
- Mouse wheel (vertical and horizontal scrolls)
- Text input (typing) and, experimentally, keyboard shortcuts
- Browser navigation (tab selection, back/forward in history)
(see the beginning of the main code file)
- Initial URL to start with, and Task to accomplish
- OpenAI models to use
- Browser screen size and zoom level
- Screenshot overlay options (grid, labels, mouse pointer)
- Action history length/depth submitted to LLM
- Allowed key shortcuts
- Stopping rules (e.g., max. number of failed actions)
- OpenAI o3 or gpt-5 reasoning models, with low reasoning effort and activated reasoning summary, typically perform best.
- Each turn consumes ~2000-5000 input tokens (mostly cached from turn to turn), and ~200-500 output tokens.
- The general-purpose LLMs excel at intent understanding (e.g. "to do X, I need to click Y"),
but may occasionally struggle with pinpointing small UI elements (e.g., calendar date pickers).
→ Be forgiving! The agent usually resolves such issues by the second or third automatic attempt.
SITE = "https://perederiy-consulting.de/"
TASK = """You are on the website of a freelancer.
Navigate to the research section.
Look there for any papers related to the SABR model.
Identify the most recent such paper.
Output the first paragraph of its abstract.
"""