-
Notifications
You must be signed in to change notification settings - Fork 652
Description
Description of the bug
In more recent versions of PyMuPDF, the contents stream can contain (invalid for PDF?) floating point numbers in scientific notation.
For example, these are generated in 1.24.1 (and 1.23):
$ mutool show /tmp/out-my1.24.1.pdf 13
13 0 obj
<<
/Length 54
/Filter /FlateDecode
>>
stream
q
255.36 0 0 328.8 7.62939e-06 0 cm
/fzImg0 Do
Q
endstream
endobj
This is what the same contents sections look like in 1.21.0:
$ mutool show /tmp/out-my1.21.0.pdf 13
13 0 obj
<<
/Length 58
/Filter /FlateDecode
>>
stream
q
255.35999 0 0 328.8 .0000076293949 0 cm
/fzImg0 Do
Q
endstream
endobj
How to reproduce the bug
Apologies up front for not being able to give a simple python script to reproduce the issue. The issue is 100% reproducible, but this requires using my archive-pdf-tools (https://github.com/internetarchive/archive-pdf-tools / https://pypi.org/project/archive-pdf-tools/) tooling. I spent a bit of time trying to make a simple proof of concept but gave up and decided to just file the issue first.
I hope the description in this issue is enough to make someone go 'aha!'.
After installing archive-pdf-tools this command can be used to generate a MRC compressed PDF (input files here: https://wizzup.org/dirlist/pymupdf/):
recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression jbig2 --hocr-file image00008.hocr -I image00008.jpg
or to do it without jbig2
installed:
recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression ccitt --hocr-file image00008.hocr -I image00008.jpg
Once the PDF is created, observe that with 1.24 (or 1.23) it's broken:
$ pdfimages -list /tmp/out.pdf
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Error (18381): Unknown operator 'e-06'
Syntax Error (18392): Too few (1) args to 'cm' operator
Or open in mupdf/evince (this will show an empty page).
The PDF will render OK in mupdf/evince when made with 1.21.
Surprisingly, PDF.js (built-in Firefox PDF renderer) renders both OK.
I also added the two generated PDFs here: https://wizzup.org/dirlist/pymupdf/)
PyMuPDF version
1.24.1
Operating system
Linux
Python version
3.11