-
Notifications
You must be signed in to change notification settings - Fork 648
Closed
Labels
upstream bugbug outside this packagebug outside this package
Description
Description of the bug
Hey, thank you so much for this amazing tool!
I am using PyMuPDF to parse many official french documents, they contain a cover, a table of contents, and pages of scanned content. The vast majority of them is read with no problem, but for a small number of them, a linebreak is inserted between each letter of the content, making it almost unreadable.
Here are links to a few documents where this happens:
- https://www.loire-atlantique.gouv.fr/contenu/telechargement/57967/423894/file/RAA n°056 du 3 avril 2023.pdf
- https://www.loire-atlantique.gouv.fr/contenu/telechargement/58441/427324/file/RAA n°78 du 28 avril 2023.pdf
- https://www.loire-atlantique.gouv.fr/contenu/telechargement/58439/427314/file/RAA n°77 du 28 avril 2023.pdf
How to reproduce the bug
For instance, here is an example with the second mentioned document:
>>> import pymupdf
>>> f = "2023-04-28-ee04e9ccb016e7806a7cf92a48155834.pdf"
>>> doc = pymupdf.Document(f)
>>> doc[0].get_text("blocks")
[
(164.6999969482422, 377.63739013671875, 436.3139953613281, 394.6753845214844, 'R\nE\nC\nU\nE\nI\nL\n \nD\nE\nS\n \nA\nC\nT\nE\nS\n \nA\nD\nMI\nN\nI\nS\nT\nR\nA\nT\nI\nF\nS\n', 0, 0),
(225.0, 531.0374145507812, 376.00396728515625, 548.0614013671875, 'n\n°\n \n7\n7\n \nd\nu\n \n2\n8\n \na\nv\nr\ni\nl\n \n2\n0\n2\n3\n', 1, 0)
]
>>> pymupdf.version
('1.24.7', '1.24.4', '20240626000001')
And here is its first page as I see it:
Please let me know if I can provide any further information!
PS: Is there any "debugging tool" that would allow you to view text and content blocks as they're seen by PyMuPDF for easier analysis?
PyMuPDF version
1.24.7
Operating system
Linux
Python version
3.11
Metadata
Metadata
Assignees
Labels
upstream bugbug outside this packagebug outside this package