Web Crawler

A minimal, real-time web search CLI that searches the internet for you. Enter a query and get search results as JSON (title, url, published_date), sorted by recency.

Setup

Prerequisites: Python 3.12+ and uv

# Clone the repository
git clone https://github.com/financial-datasets/web-crawler.git

# Navigate into the project root:
cd web-crawler

How to Run

# From the repo root, run:
uv run web-crawler

When prompted, enter your search (e.g., "AAPL latest earnings transcript").
Results print as JSON. Enter another query to continue.
Quit with q, quit, exit, or press Ctrl+C.

Features

We currently have two features:

Search (see /src/search)
Parse (see /src/parse)

Search

Given a query (e.g. "Apple latest earnings"), search the internet for pages related to the query.

Parse

Given a URL, parse and extract the text content from the URL.

Roadmap

We'd love to get help on:

Parsing content from JavaScript-heavy pages (e.g. MSN, Bloomberg, etc.)
Summarizing parsed content with LLMs
Adding more sources (Reddit, Bloomberg, etc.)
Parallelization for faster queries

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Web Crawler

Setup

How to Run

Features

Search

Parse

Roadmap

About

Uh oh!

Releases

Packages

Languages

financial-datasets/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Setup

How to Run

Features

Search

Parse

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages