GitHub - ACM-VIT/scrag: A flexible web scraper that intelligently adapts to different website structures using multiple extraction strategies (newspaper3k, readability-lxml, BeautifulSoup, and optional headless rendering). It outputs clean, structured data for RAG pipelines or local LLMs, with an optional extension to automatically build RAG indexes from web queries.

Scrag

Adaptive, multi‑strategy web scraper that extracts clean text and metadata for RAG pipelines and local LLM workflows.

About

This project proposes a robust web scraper designed for maximum adaptability. It automatically adjusts to various website structures by employing a multi‑strategy extraction approach. The scraper attempts to extract clean, structured text and metadata (including title, author, and date) using methods such as newspaper3k, readability‑lxml, and BeautifulSoup heuristics. In cases where these methods are insufficient, it can optionally fall back to headless browser rendering to capture content from more complex, dynamically loaded websites. The output is specifically formatted for integration into Retrieval‑Augmented Generation (RAG) pipelines or for use with local Large Language Models (LLMs).

An ambitious, optional extension to this project is the Universal RAG Builder. This layer would automatically identify and scrape top‑ranked websites relevant to a user's query, subsequently building a RAG index from the collected data. This feature addresses a key limitation of local LLMs, their inability to browse the internet by providing automated knowledge aggregation and up‑to‑date information retrieval without requiring manual data collection. The project's user interface will initially be a Command Line Interface (CLI), with plans for a web‑based version to cater to users who prefer a more visually appealing display.

Features

Multi‑strategy extraction: newspaper3k, readability‑lxml, and BeautifulSoup‑based heuristics
JavaScript support: Optional Selenium and Playwright extractors for dynamic content ⚡
Metadata capture: title, author, and date when available
RAG pipeline integration: Complete embedding, indexing, and retrieval system 🧠
Flexible configuration: YAML-based configuration with environment support
CLI first: Simple commands to extract, process, and query content
Graceful fallbacks: Automatic fallback between extraction strategies
Performance optimized: Async extraction and configurable timeouts

Web Rendering Capabilities

For JavaScript-heavy pages that require browser automation:

# Install with web rendering support
pip install 'scrag[web-render]'

# Extract from single-page applications
scrag extract https://spa-example.com --selenium --browser chrome

# Use Playwright for modern web apps
scrag extract https://dynamic-site.com --playwright --browser chromium

# List available extractors
scrag extractors

Supported browsers:

Selenium: Chrome, Firefox
Playwright: Chromium, Firefox, WebKit

⚠️ Note: Web rendering extractors have a heavy footprint and are not recommended for CI environments. See Web Rendering Guide for details.

Project Structure

scrag/
├── src/scrag/                 # Main source code
│   ├── extractors/           # Content extraction strategies
│   ├── processors/           # Text processing and cleaning
│   ├── storage/              # Storage backends and adapters
│   ├── rag/                  # RAG pipeline components
│   ├── cli/                  # Command-line interface
│   ├── web/                  # Web interface (planned)
│   └── utils/                # Utility functions
├── tests/                    # Comprehensive test suite
│   ├── unit/                 # Unit tests
│   ├── integration/          # Integration tests
│   ├── performance/          # Performance benchmarks
│   └── fixtures/             # Test data and mocks
├── docs/                     # Documentation
│   ├── api/                  # API reference
│   ├── guides/               # User guides
│   └── tutorials/            # Step-by-step tutorials
├── research/                 # Research & architectural decisions
│   ├── spikes/               # Discovery and research spikes
│   ├── decisions/            # Architecture Decision Records (ADRs)
│   └── epics/                # Epic planning and documentation
├── config/                   # Configuration files
│   ├── extractors/           # Extractor configurations
│   └── rag/                  # RAG pipeline configurations
├── deployment/               # Deployment configurations
│   ├── docker/               # Docker configurations
│   ├── kubernetes/           # Kubernetes manifests
│   └── aws/                  # AWS deployment files
├── scripts/                  # Development and build scripts
└── ARCHITECTURE.md           # Detailed architecture documentation

For detailed architecture information, see ARCHITECTURE.md.

Roadmap

Universal RAG Builder: auto‑discover top results for a query, scrape them, and build a ready‑to‑use RAG index.
Web UI: a lightweight interface for users who prefer a visual workflow.
Export adapters: convenient formats for popular vector DBs and RAG frameworks.

Current Release (v1.0)

Multi-strategy extraction: newspaper3k, readability-lxml, and BeautifulSoup-based heuristics
Metadata capture: title, author, and date when available
CLI interface with configuration management
Extensible pipeline architecture

Universal RAG Builder (v2.0) - In Planning

Split into two focused EPICs for better implementation:

EPIC 1: Web Crawling & Discovery System

Discovery Query: Convert user intent to search strategies
Search Integration: Web search APIs, RSS feeds, sitemap discovery
Intelligent Fetching: Rate limiting, retry logic, content deduplication
CrawlManager: Robust error handling and job orchestration

EPIC 2: RAG Index Construction

Multi-model Embeddings: Support for various embedding models
Flexible Storage: Abstract IndexStore interface for different backends
Optimized Retrieval: Fast semantic search and query capabilities
Content Chunking: RAG-optimized text segmentation

Documentation & Research

/research/ directory: Maintains architectural decisions, spikes, and institutional knowledge
Spike-driven development: Thorough research before implementation
Comprehensive issue templates: Clear Definition of Done for all tasks

Quick Start

# 1) Fork and clone
# Click Fork on GitHub, then:
 git clone https://github.com/ACM-VIT/scrag.git
 cd scrag

# 2) Create a branch
 git checkout -b feat/your-feature

# 3) Install dependencies
 uv sync
 uv pip install -e src/scrag

> **Note:** This project uses `uv` as the canonical dependency manager. Dependencies are defined in `src/scrag/pyproject.toml` and managed via `uv.lock`. Do not use `pip install -r requirements.txt` as the root `requirements.txt` has been removed to avoid conflicts.

# 4) Verify the CLI
 uv run scrag info

Usage

Run the Typer-powered CLI after syncing dependencies (as shown in Quick Start).

# Extract a single page using the default strategy cascade
uv run scrag extract https://example.com/article

# Choose a custom output location and persist as plain text
uv run scrag extract https://example.com/article --output data/custom --format txt

# Relax the minimum content length requirement for sparse pages
uv run scrag extract https://example.com/article --min-length 50

Contributing

We welcome contributions of all kinds! Please read our Contributing Guidelines to get started quickly and make your PRs count.

Hacktoberfest

Join us for Hacktoberfest! Quality > quantity.

Aim for meaningful, well‑scoped PR/MRs that solve real issues.
Non‑code contributions (docs, design, tutorials) are welcome via PR.
Full participation details: https://hacktoberfest.com/participation

Submitting a Pull Request

Fork the repository (top‑right on GitHub)

Clone your fork locally:

git clone <HTTPS-ADDRESS>
cd <NAME-OF-REPO>

Create a new branch:
```
git checkout -b <your-branch-name>
```
Make your changes and stage them:
```
git add .
```
Commit your changes:
```
git commit -m "feat: your message"
```
Push to your fork:
```
git push origin <your-branch-name>
```
Open a Pull Request and clearly describe what you changed and why. Link related issues (e.g., “Fixes #123”).

Guidelines for Pull Request

Avoid PRs that are automated/scripted or plagiarized from someone else’s work.
Don’t spam; keep each PR focused and meaningful.
The project maintainer’s decision on PR validity is final.
For more, see our Contributing Guidelines and the Hacktoberfest participation rules.

Authors

Authors:
Contributors:

Community & Conduct

By participating in this project, you agree to abide by our Code of Conduct.

Made with ❤️ by ACM‑VIT

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
config		config
deployment		deployment
docs		docs
research		research
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
PR_DESCRIPTION.md		PR_DESCRIPTION.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scrag

Table of Contents

About

Features

Web Rendering Capabilities

Project Structure

Roadmap

Current Release (v1.0)

Universal RAG Builder (v2.0) - In Planning

Documentation & Research

Quick Start

Usage

Contributing

Hacktoberfest

Submitting a Pull Request

Guidelines for Pull Request

Authors

Community & Conduct

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 7

Languages

License

ACM-VIT/scrag

Folders and files

Latest commit

History

Repository files navigation

Scrag

Table of Contents

About

Features

Web Rendering Capabilities

Project Structure

Roadmap

Current Release (v1.0)

Universal RAG Builder (v2.0) - In Planning

Documentation & Research

Quick Start

Usage

Contributing

Hacktoberfest

Submitting a Pull Request

Guidelines for Pull Request

Authors

Community & Conduct

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 7

Languages

Packages