Important
This project moved to new repository privateai-com/docviz and is no longer maintained. This repository is kept for historical purposes only.
Intelligent PDF document analysis with AI-powered chart understanding
Overview β’ Installation β’ Quick Start β’ Output Structure
Page Parser is a sophisticated Python pipeline designed to extract meaningful content from PDF documents. It intelligently separates and processes different document elementsβfrom complex charts and diagrams to clean text passagesβdelivering structured, actionable data.
- PDF Conversion - configurable high-resolution PDF-to-PNG conversion
- Smart Detection - YOLO-powered layout analysis for charts, figures, and text regions
- AI Summarization - Intelligent chart interpretation using vision models
- Clean Text Extraction - OCR with automatic chart masking for pristine text output
- Multi-Provider - Flexible AI backend integration
- Structured Output - Rich JSON export with bounding boxes and metadata
You can see how it works step-by-step in showcase.ipynb.
git clone https://github.com/fresh-milkshake/page-parser/
cd page-parser
uv sync
uv run main.py
Optionally, you can install development or showcase dependencies:
uv sync --dev
uv sync --showcase # for showcase.ipynb
- Python 3.13+
- Tesseract OCR (for text extraction)
- YOLO model (download from project releases)
python main.py <pdf_path> <model_path> <output_dir> <output_json>
$ python main.py --help
Usage: main.py [OPTIONS] DOCUMENT_PATH MODEL_PATH OUTPUT_DIR OUTPUT_JSON
CLI for running the document analysis pipeline and saving results to a JSON
file.
Options:
--log-level [DEBUG|INFO|WARNING|ERROR]
Logging level. [default: INFO]
--log-file FILE Path to log file (optional).
--settings-file FILE Path to settings file.
--help Show this message and exit.
Example:
uv run python main.py data/2507.21509v1.pdf models/yolov12l-doclaynet.pt output output.json
Create or modify settings.toml
to customize behavior:
[vision]
provider = "openai" # Options are every provider from vision.providers
retries = 3 # How many times to retry if failed
timeout = 10 # Timeout for requests
[vision.providers.openai]
model = "gpt-4o"
base_url = "https://api.openai.com/v1"
api_key = { type = "env", name = "OPENAI_API_KEY" }
[vision.providers.ollama]
model = "llama3.2-vision"
base_url = "http://localhost:11434/v1"
api_key = { type = "inline", key = "dummy-key" }
[processing]
ocr_lang = "eng" # Tesseract language
zoom_factor = 2 # PDF scaling factor (more is more detailed, but slower)
[filtration]
chart_labels = ["picture", "figure", "chart", "diagram"]
You can freely modify the settings to your needs. Main things to know is only that api_key have 3 types:
-
inline
- inline key, example:api_key = { type = "inline", key = "dummy-key" }
-
env
- environment variable, example:api_key = { type = "env", name = "OPENAI_API_KEY" }
-
file
- file path, example:api_key = { type = "file", path = "path/to/key.txt" }
src/
βββ pipeline/
β βββ document/ # PDF processing & layout detection
β β βββ convert.py # PDF to image conversion
β β βββ detector.py # YOLO-based element detection
β β βββ text_extraction.py # OCR processing
β βββ image/ # Visual processing & AI analysis
β β βββ preprocessing.py # Image preparation
β β βββ summarizer.py # AI-powered chart analysis
β β βββ annotate.py # Visualization utilities
β βββ pipeline.py # Main orchestration
βββ config/ # Configuration management
βββ common/ # Shared utilities & logging
Each processed page generates structured JSON:
{
"page_number": 1,
"elements": [
{
"type": "chart",
"label": "figure",
"summary": "Bar chart showing quarterly revenue growth of 15% across Q1-Q4 2024...",
"bbox": [100, 200, 400, 500]
},
{
"type": "text",
"text": "Clean extracted text content without visual interference...",
"bbox": [0, 0, 1224, 1584]
}
]
}
page_number
: Sequential page identifierelements
: Array of detected document componentstype
: Element classification (chart
,text
,figure
, etc.)bbox
: Precise bounding box coordinates[x, y, width, height]
summary
: AI-generated content description (for visual elements)text
: Extracted textual content (for text elements)
- Computer Vision: OpenCV, Ultralytics YOLO
- OCR Engine: Tesseract (via pytesseract)
- PDF Processing: PyMuPDF
- AI Integration: OpenAI API
- CLI Framework: Click
- Logging: Loguru
- YOLO v12 (Large/Medium) trained on DocLayNet dataset
- Download pre-trained models from project releases
# Process a research paper
python main.py research_paper.pdf models/yolov12l-doclaynet.pt ./analysis results.json
# View processing logs
tail -f logs/$(date +%Y-%m-%d).log
# Examine results
cat results.json | jq '.[] | select(.elements[].type == "chart")'
This project is licensed under the Apache License 2.0. See LICENSE for details.
When using or redistributing this software, you must:
- Preserve copyright notices in all copies
- Mark modified files with prominent change notices
- Retain attribution notices in derivative works
- Link to original repository where possible
- Find more efficient OCR model
- Enhance preprocessing before OCR
- Remove noise from images (page numbers, titles, etc.)
-
Add slight quality degradation for better speed()
- Unify label naming conventions
-
Enrich vision model context with table and chart captions(too low detection quality and bad consistency)
Built by fresh-milkshake
Give this project a β if you found it useful!
β¬οΈ Back to top