Skip to content

Invalid OCGs not ignored by SVG image creation #3569

@JorjMcKie

Description

@JorjMcKie

Discussed in #3567

Originally posted by serhii-brovarnyk June 11, 2024
Hello!

I have a PDF file with only one page I got via another tool for PDF documents, and my PDF document has some OCGs.
Unfortunately, I cannot provide the actual file.

If I try to get the pixmap of the page, it is completely OK, but when I try to get an SVG image via page.get_svg_image(text_as_path=False) method then the appearance of the page is completely different.

Investigating the issue I`ve concluded that some of the clip-paths affect the appearance of the drawing that I see.
The defs section does not have any relation to the layers or OCGS but some of the groups look like this:

<g clip-path="url(#clip_1)">
  <g id="layer_1" data-name="SomeName">
	  <path transform="matrix(0,-.06,-.06,-0,3024,2160)"
	      d="M28564 13431V14031L27914 13431ZM28564 14031 27914 13431V14031Z"
	      fill="#7f7f7f"/>
  </g>

If I delete a certain clip-path in the defs section then I`ll get more visible content on the SVG image so I suppose the only reason that I get such a result is the SVG has some invisible data from some of the OCGs and since it does not being managed by the PDF I see it whether I suppose to see it or not.

So my question is How to detect and delete invisible and unnecessary OCGs from my PDF document so I won`t see the difference between the SVG image and the pixmap that I got from the pymupdf Page object?

It is important to notice, that the pymupdf Document object does not have any info about layers or OCGs.
I have tried doc.get_layers(), doc.get_ocgs(), doc.layer_ui_configs() methods but they return empty lists.
But page.get_oc_items() returns such a list of OCGs:

> [('oc10', 68, 'ocg'),
>  ('oc1009', 67, 'ocg'),
>  ('oc1010', 66, 'ocg'),
>  ...
>  ('oc945', 7, 'ocg'),
>  ('oc946', 6, 'ocg'),
>  ('oc947', 5, 'ocg')]

Also, I used such a code

page_xref = doc.page_xref(0)
xref_keys = doc.xref_get_keys(page_xref)
for key in xref_keys:
    print(f"KEY: {key}")
    print(doc.xref_get_key(page_xref, key))
    print('---------------')

To get such info:

> KEY: Contents
> ('xref', '80 0 R')
> ---------------
> KEY: MediaBox
> ('array', '[0 0 2160 3024]')
> ---------------
> KEY: Parent
> ('xref', '82 0 R')
> ---------------
> KEY: Resources
> ('dict', '<</ExtGState<</GT255 79 0 R>>/Font<</F1 74 0 R/F2 69 0 R>>/ProcSet[/PDF/Text/ImageC]/Properties<</oc10 68 0 R/oc1009 67 0 R/oc1010 66 0 R/oc1011 65 0 R/oc1013 64 0 R/oc1014 63 0 R/oc1023 62 0 R/oc1027 61 0 R/oc16 60 0 R/oc17 59 0 R/oc19 58 0 R/oc2 57 0 R/oc3 56 0 R/oc4 55 0 R/oc5 54 0 R/oc507 53 0 R/oc6 52 0 R/oc7 51 0 R/oc8 50 0 R/oc832 49 0 R/oc833 48 0 R/oc834 47 0 R/oc835 46 0 R/oc840 45 0 R/oc842 44 0 R/oc843 43 0 R/oc844 42 0 R/oc848 41 0 R/oc850 40 0 R/oc852 39 0 R/oc853 38 0 R/oc855 37 0 R/oc856 36 0 R/oc858 35 0 R/oc861 34 0 R/oc862 33 0 R/oc863 32 0 R/oc868 31 0 R/oc869 30 0 R/oc870 29 0 R/oc875 28 0 R/oc876 27 0 R/oc877 26 0 R/oc878 25 0 R/oc880 24 0 R/oc883 23 0 R/oc884 22 0 R/oc885 21 0 R/oc898 20 0 R/oc9 19 0 R/oc909 18 0 R/oc925 17 0 R/oc926 16 0 R/oc929 15 0 R/oc931 14 0 R/oc934 13 0 R/oc935 12 0 R/oc936 11 0 R/oc937 10 0 R/oc942 9 0 R/oc943 8 0 R/oc945 7 0 R/oc946 6 0 R/oc947 5 0 R>>>>')
> ---------------
> KEY: Rotate
> ('int', '270')
> ---------------
> KEY: Type
> ('name', '/Page')
> ---------------
> KEY: VP
> ('array', '[]')
> ---------------

In conclusion, this document has some OCGs that are accessible only on the Page level. I want to preserve only visible OCGs to get the right appearance of the resulting SVG image and delete the rest. Can you give me some advice on how to do it?
I have read 2 similar discussions (about OCGs) but eventually did not get the answer :(

Metadata

Metadata

Assignees

No one assigned

    Labels

    upstream bugbug outside this package

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions