📄 Page Parser

Important

This project moved to new repository privateai-com/docviz and is no longer maintained. This repository is kept for historical purposes only.

📄 Page Parser

Intelligent PDF document analysis with AI-powered chart understanding

Overview • Installation • Quick Start • Output Structure

Overview

Page Parser is a sophisticated Python pipeline designed to extract meaningful content from PDF documents. It intelligently separates and processes different document elements—from complex charts and diagrams to clean text passages—delivering structured, actionable data.

Key Capabilities

PDF Conversion - configurable high-resolution PDF-to-PNG conversion
Smart Detection - YOLO-powered layout analysis for charts, figures, and text regions
AI Summarization - Intelligent chart interpretation using vision models
Clean Text Extraction - OCR with automatic chart masking for pristine text output
Multi-Provider - Flexible AI backend integration
Structured Output - Rich JSON export with bounding boxes and metadata

Showcase

You can see how it works step-by-step in showcase.ipynb.

Original page with chart

Chart region extracted by Page Parser

Gemma3 output

Installation

git clone https://github.com/fresh-milkshake/page-parser/
cd page-parser
uv sync
uv run main.py

Optionally, you can install development or showcase dependencies:

uv sync --dev
uv sync --showcase # for showcase.ipynb

Prerequisites

Python 3.13+
Tesseract OCR (for text extraction)
YOLO model (download from project releases)

Quick Start

Basic Usage

python main.py <pdf_path> <model_path> <output_dir> <output_json>

Command Line Options

$ python main.py --help

Usage: main.py [OPTIONS] DOCUMENT_PATH MODEL_PATH OUTPUT_DIR OUTPUT_JSON

  CLI for running the document analysis pipeline and saving results to a JSON
  file.

Options:
  --log-level [DEBUG|INFO|WARNING|ERROR]
                                  Logging level.  [default: INFO]
  --log-file FILE                 Path to log file (optional).
  --settings-file FILE            Path to settings file.
  --help                          Show this message and exit.

Example:

uv run python main.py data/2507.21509v1.pdf models/yolov12l-doclaynet.pt output output.json

Configuration

Create or modify settings.toml to customize behavior:

[vision]
provider = "openai"       # Options are every provider from vision.providers
retries = 3               # How many times to retry if failed
timeout = 10              # Timeout for requests

[vision.providers.openai]
model = "gpt-4o"
base_url = "https://api.openai.com/v1"
api_key = { type = "env", name = "OPENAI_API_KEY" }

[vision.providers.ollama]
model = "llama3.2-vision"
base_url = "http://localhost:11434/v1"
api_key = { type = "inline", key = "dummy-key" }

[processing]
ocr_lang = "eng"           # Tesseract language
zoom_factor = 2            # PDF scaling factor (more is more detailed, but slower)

[filtration]
chart_labels = ["picture", "figure", "chart", "diagram"]

You can freely modify the settings to your needs. Main things to know is only that api_key have 3 types:

inline - inline key, example:

api_key = { type = "inline", key = "dummy-key" }

env - environment variable, example:

api_key = { type = "env", name = "OPENAI_API_KEY" }

file - file path, example:

api_key = { type = "file", path = "path/to/key.txt" }

Project Structure

src/
├── pipeline/
│   ├── document/                 # PDF processing & layout detection
│   │   ├── convert.py            # PDF to image conversion
│   │   ├── detector.py           # YOLO-based element detection
│   │   └── text_extraction.py    # OCR processing
│   ├── image/                    # Visual processing & AI analysis
│   │   ├── preprocessing.py      # Image preparation
│   │   ├── summarizer.py         # AI-powered chart analysis
│   │   └── annotate.py           # Visualization utilities
│   └── pipeline.py               # Main orchestration
├── config/                       # Configuration management
└── common/                       # Shared utilities & logging

Output Format

Each processed page generates structured JSON:

{
  "page_number": 1,
  "elements": [
    {
      "type": "chart",
      "label": "figure", 
      "summary": "Bar chart showing quarterly revenue growth of 15% across Q1-Q4 2024...",
      "bbox": [100, 200, 400, 500]
    },
    {
      "type": "text",
      "text": "Clean extracted text content without visual interference...",
      "bbox": [0, 0, 1224, 1584]
    }
  ]
}

Output Structure

page_number: Sequential page identifier
elements: Array of detected document components
type: Element classification (chart, text, figure, etc.)
bbox: Precise bounding box coordinates [x, y, width, height]
summary: AI-generated content description (for visual elements)
text: Extracted textual content (for text elements)

Dependencies

Core Technologies

Computer Vision: OpenCV, Ultralytics YOLO
OCR Engine: Tesseract (via pytesseract)
PDF Processing: PyMuPDF
AI Integration: OpenAI API
CLI Framework: Click
Logging: Loguru

Model Requirements

YOLO v12 (Large/Medium) trained on DocLayNet dataset
Download pre-trained models from project releases

Example Results

# Process a research paper
python main.py research_paper.pdf models/yolov12l-doclaynet.pt ./analysis results.json

# View processing logs
tail -f logs/$(date +%Y-%m-%d).log

# Examine results
cat results.json | jq '.[] | select(.elements[].type == "chart")'

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Attribution Requirements

When using or redistributing this software, you must:

Preserve copyright notices in all copies
Mark modified files with prominent change notices
Retain attribution notices in derivative works
Link to original repository where possible

Further plans

Find more efficient OCR model
- Try to use EasyOCR
- Try to use PaddleOCR
Enhance preprocessing before OCR
- Remove noise from images (page numbers, titles, etc.)
- ~~Add slight quality degradation for better speed~~ ()
Unify label naming conventions
~~Enrich vision model context with table and chart captions~~ (too low detection quality and bad consistency)

Built by fresh-milkshake
Give this project a ⭐ if you found it useful!

⬆️ Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
data		data
models		models
src		src
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
experiment.py		experiment.py
layout_detection.ipynb		layout_detection.ipynb
main.py		main.py
pyproject.toml		pyproject.toml
quality_check.py		quality_check.py
run_test.py		run_test.py
settings.toml		settings.toml
showcase.ipynb		showcase.ipynb
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📄 Page Parser

Overview

Key Capabilities

Showcase

Installation

Prerequisites

Quick Start

Basic Usage

Command Line Options

Configuration

Project Structure

Output Format

Output Structure

Dependencies

Core Technologies

Model Requirements

Example Results

License

Attribution Requirements

Further plans

About

Uh oh!

Languages

License

fresh-milkshake/page-parser

Folders and files

Latest commit

History

Repository files navigation

📄 Page Parser

Overview

Key Capabilities

Showcase

Installation

Prerequisites

Quick Start

Basic Usage

Command Line Options

Configuration

Project Structure

Output Format

Output Structure

Dependencies

Core Technologies

Model Requirements

Example Results

License

Attribution Requirements

Further plans

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages