Simple Browsing AI Agent

Intro

This repository provides a lightweight – yet fully functional – AI-driven browsing agent for Chrome.
It opens a given website in an isolated/dedicated Chrome window and automatically navigates it, step by step, to accomplish a specified task.
At each turn (step), the AI LLM:

checks whether its previous action was successful
decides on the next atomic action (see below) to take and explains its decision with a rationale
optionally, explains previous failures and summarizes past or plans future actions

The agent can be applied to a wide range of tasks, including:

Booking preparations
Form filling
Autonomous website testing
Information retrieval beyond standard search engines etc

Technical Features:

Uses Selenium for browser visualization and action implementation
Uses JSON with schema for action definitions with structured OpenAI output
Operates with general-purpose LLMs (not with specialized “computer-use” models)
Employs no Langchain or similar frameworks — the code maintains full control over LLM inputs / history via direct OpenAI-API calls
Sends to the LLM only what a human user/surfer would see: browsing-window screenshots and window titles.
No DOM, XPath, or similar data is sent.
Uses, for each turn:
- suitable prompts (mostly static)
- plus a fine-tuned history of the agent’s past actions
- plus 2 screenshots per active browser tab: taken before and after the preceding action
Overlays the screenshots with grids / labels / mouse pointers, for better LLM interpretation
Compact implementation (~1000 lines), easily extensible, optimized/tested for Windows

Supported atomic actions

Mouse click
Mouse click & drag
Mouse wheel (vertical and horizontal scrolls)
Text input (typing) and, experimentally, keyboard shortcuts
Browser navigation (tab selection, back/forward in history)

Configurable Settings

(see the beginning of the main code file)

Initial URL to start with, and Task to accomplish
OpenAI models to use
Browser screen size and zoom level
Screenshot overlay options (grid, labels, mouse pointer)
Action history length/depth submitted to LLM
Allowed key shortcuts
Stopping rules (e.g., max. number of failed actions)

Remarks

OpenAI o3 or gpt-5 reasoning models, with low reasoning effort and activated reasoning summary, typically perform best.
Each turn consumes ~2000-5000 input tokens (mostly cached from turn to turn), and ~200-500 output tokens.
The general-purpose LLMs excel at intent understanding (e.g. "to do X, I need to click Y"),
but may occasionally struggle with pinpointing small UI elements (e.g., calendar date pickers).
→ Be forgiving! The agent usually resolves such issues by the second or third automatic attempt.

Example

Settings:

SITE    = "https://perederiy-consulting.de/"  
TASK    = """You are on the website of a freelancer.  
Navigate to the research section.
Look there for any papers related to the SABR model. 
Identify the most recent such paper.  
Output the first paragraph of its abstract.
"""

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
screenshots		screenshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
browsing_agent.py		browsing_agent.py
example_output.PNG		example_output.PNG
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Simple Browsing AI Agent

Intro

Technical Features:

Supported atomic actions

Configurable Settings

Remarks

Example

Settings:

Output:

About

Uh oh!

Releases

Packages

Languages

Uh oh!

License

Uh oh!

vlad-perederiy/simple_browsing_agent

Folders and files

Latest commit

History

Repository files navigation

Simple Browsing AI Agent

Intro

Technical Features:

Supported atomic actions

Configurable Settings

Remarks

Example

Settings:

Output:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages