-
Notifications
You must be signed in to change notification settings - Fork 647
Closed
Labels
fix developedrelease schedule to be determinedrelease schedule to be determinedupstream bugbug outside this packagebug outside this package
Description
Description of the bug
Document.select() is not working in some particular kind of pdf files.
I want to extract text from pdf files. If pdf has >30 pages then I extract first 30 pages from the file.
The attached pdf file have 33 pages. So, the code should select first 30 pages and extract text from it.
But It only extract some bullets and dashes from the file and I can't figure out why it is happening.
Code works perfectly in other pdf files.
946f8445-6373-4f32-994c-04c495e2e7e9.pdf
Here is my code.
import os
import pathlib
import fitz
def get_all_page_from_pdf(document, last_page=None):
if last_page:
document.select(list(range(0, last_page)))
if document.page_count > 30:
document.select(list(range(0, 30)))
return iter(page for page in document)
path = "path to the pdf file"
filename = os.path.basename(path)
file_type = pathlib.Path(filename).suffix
read_file = open(path, "rb")
file_data = read_file.read()
doc = fitz.open(filename=filename, stream=file_data, filetype=file_type)
for i, page in enumerate(get_all_page_from_pdf(doc)):
text = page.get_text()
print(i, text)
How to reproduce the bug
You can reproduce the Bug/issue by running the given script and attached pdf file.
PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.10
Metadata
Metadata
Assignees
Labels
fix developedrelease schedule to be determinedrelease schedule to be determinedupstream bugbug outside this packagebug outside this package