A minimal, real-time web search CLI that searches the internet for you. Enter a query and get search results as JSON (title, url, published_date), sorted by recency.

Prerequisites: Python 3.12+ and uv
# Clone the repository
git clone https://github.com/financial-datasets/web-crawler.git
# Navigate into the project root:
cd web-crawler
# From the repo root, run:
uv run web-crawler
- When prompted, enter your search (e.g., "AAPL latest earnings transcript").
- Results print as JSON. Enter another query to continue.
- Quit with
q
,quit
,exit
, or press Ctrl+C.
We currently have two features:
- Search (see
/src/search
) - Parse (see
/src/parse
)
Given a query (e.g. "Apple latest earnings"), search the internet for pages related to the query.
Given a URL, parse and extract the text content from the URL.
We'd love to get help on:
- Parsing content from JavaScript-heavy pages (e.g. MSN, Bloomberg, etc.)
- Summarizing parsed content with LLMs
- Adding more sources (Reddit, Bloomberg, etc.)
- Parallelization for faster queries