Skip to content

Contents stream contains floats in scientific notation #3381

@MerlijnWajer

Description

@MerlijnWajer

Description of the bug

In more recent versions of PyMuPDF, the contents stream can contain (invalid for PDF?) floating point numbers in scientific notation.

For example, these are generated in 1.24.1 (and 1.23):

$ mutool show /tmp/out-my1.24.1.pdf 13
13 0 obj
<<
  /Length 54
  /Filter /FlateDecode
>>
stream

q
255.36 0 0 328.8 7.62939e-06 0 cm
/fzImg0 Do
Q
endstream
endobj

This is what the same contents sections look like in 1.21.0:

$ mutool show /tmp/out-my1.21.0.pdf 13
13 0 obj
<<
  /Length 58
  /Filter /FlateDecode
>>
stream

q
255.35999 0 0 328.8 .0000076293949 0 cm
/fzImg0 Do
Q
endstream
endobj

How to reproduce the bug

Apologies up front for not being able to give a simple python script to reproduce the issue. The issue is 100% reproducible, but this requires using my archive-pdf-tools (https://github.com/internetarchive/archive-pdf-tools / https://pypi.org/project/archive-pdf-tools/) tooling. I spent a bit of time trying to make a simple proof of concept but gave up and decided to just file the issue first.

I hope the description in this issue is enough to make someone go 'aha!'.

After installing archive-pdf-tools this command can be used to generate a MRC compressed PDF (input files here: https://wizzup.org/dirlist/pymupdf/):

recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression jbig2 --hocr-file image00008.hocr -I image00008.jpg

or to do it without jbig2 installed:

recode_pdf -o /tmp/out.pdf -m 2 --bg-downsample 2 --dpi 600 --mask-compression ccitt --hocr-file image00008.hocr -I image00008.jpg

Once the PDF is created, observe that with 1.24 (or 1.23) it's broken:

$ pdfimages -list /tmp/out.pdf 
page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
Syntax Error (18381): Unknown operator 'e-06'
Syntax Error (18392): Too few (1) args to 'cm' operator

Or open in mupdf/evince (this will show an empty page).

The PDF will render OK in mupdf/evince when made with 1.21.

Surprisingly, PDF.js (built-in Firefox PDF renderer) renders both OK.

I also added the two generated PDFs here: https://wizzup.org/dirlist/pymupdf/)

PyMuPDF version

1.24.1

Operating system

Linux

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions