Skip to content

Cannot get Tessdata with Tesseract-OCR 5 #3767

@rezemika

Description

@rezemika

Description of the bug

The pymupdf.get_tessdata() function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).

>>> import pymupdf
>>> pymupdf.get_tessdata()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<...>/venv/lib/python3.11/site-packages/pymupdf/__init__.py", line 18082, in get_tessdata
    for sub_response in response.iterdir():
                        ^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'iterdir'

>>> pymupdf.version
('1.24.9', '1.24.8', '20240724000001')

How to reproduce the bug

I haven't looked into the details yet, but I think the problem lays here:

PyMuPDF/src/__init__.py

Lines 18093 to 18099 in eca7066

# determine tessdata via iteration over subfolders
tessdata = None
for sub_response in response.iterdir():
for sub_sub in sub_response.iterdir():
if str(sub_sub).endswith("tessdata"):
tessdata = sub_sub
break

I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/.
The function get_tessdata() expects it in /usr/share/tesseract-ocr/4.00/tessdata, else it will search it with whereis tesseract-ocr.

However, it tries to iterdir on the subprocess response, even though it's a list of bytes, which raises the error.

>>> import subprocess
>>> cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
>>> cp
CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'')
>>> response = cp.stdout.strip().split()
>>> response
[b'tesseract-ocr:', b'/usr/share/tesseract-ocr']
>>> type(response), type(response[0])
(<class 'list'>, <class 'bytes'>)
>>> 
>>> response.iterdir()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'list' object has no attribute 'iterdir'

I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata, and should find it in the second part of response. So I guess something like this should work?

import subprocess
cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
response = cp.stdout.strip().split()
import pathlib
response_dir = pathlib.Path(response[1].decode("utf-8"))
# response_dir == PosixPath('/usr/share/tesseract-ocr')
for sub_dir in response_dir.iterdir():
    for sub_sub_dir in sub_dir.iterdir():
        if sub_sub_dir.name.endswith("tessdata"):
            tessdata = str(sub_sub_dir)
            break
# tessdata == '/usr/share/tesseract-ocr/5/tessdata'

Yeah, I know I should set the TESSDATA_PREFIX environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?

Thanks for developing PyMuPDF! :)

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugfix developedrelease schedule to be determined

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions