PDF Text Annotation Extractor with Tesseract

Description

This Python script is based on Allen AI's PAWLs project and is designed to perform OCR on PDF files to extract textual annotations present in each page. It leverages the pytesseract OCR engine. The output generated is aimed to be consistent across documents, making it ideal for machine learning training data, analytics, and other applications requiring standardized, high-quality text extraction.

While the PAWLs project offers a powerful preprocessor for handling PDFs, it is not actively maintained. Therefore, this separate project has been initiated to fill that gap and to continuously provide a reliable text extraction solution. Future updates plan to introduce additional preprocessors based on other underlying PDF engines.

Requirements

Python 3.9
pytesseract
pdf2image
pandas

Installation

Install the required Python packages using pip (from repo root):

cd pdfpreprocessor
pip install .

Development

Run linter (after install dependencies):

hatch run lint:fmt

Usage

To run the script, simply call the process_tesseract function, passing in the PDF file's path as an argument:

```python from preprocessors.tesseract import process_tesseract

annotations = process_tesseract("path/to/your/pdf/file.pdf") ```

Testing

Unit tests should be written to cover each of these functions. Testing can help ensure that the OCR extraction and scaling logic work as expected.

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the Apache-2 License - see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
pdfpreprocessor		pdfpreprocessor
tests		tests
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Text Annotation Extractor with Tesseract

Description

Requirements

Installation

Development

Usage

Testing

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

Open-Source-Legal/PDF-Preprocessors

Folders and files

Latest commit

History

Repository files navigation

PDF Text Annotation Extractor with Tesseract

Description

Requirements

Installation

Development

Usage

Testing

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages