-
Notifications
You must be signed in to change notification settings - Fork 647
Description
Description of the bug
The pymupdf.get_tessdata()
function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).
>>> import pymupdf
>>> pymupdf.get_tessdata()
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "<...>/venv/lib/python3.11/site-packages/pymupdf/__init__.py", line 18082, in get_tessdata
for sub_response in response.iterdir():
^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'iterdir'
>>> pymupdf.version
('1.24.9', '1.24.8', '20240724000001')
How to reproduce the bug
I haven't looked into the details yet, but I think the problem lays here:
Lines 18093 to 18099 in eca7066
# determine tessdata via iteration over subfolders | |
tessdata = None | |
for sub_response in response.iterdir(): | |
for sub_sub in sub_response.iterdir(): | |
if str(sub_sub).endswith("tessdata"): | |
tessdata = sub_sub | |
break |
I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/
.
The function get_tessdata()
expects it in /usr/share/tesseract-ocr/4.00/tessdata
, else it will search it with whereis tesseract-ocr
.
However, it tries to iterdir
on the subprocess response, even though it's a list of bytes, which raises the error.
>>> import subprocess
>>> cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
>>> cp
CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'')
>>> response = cp.stdout.strip().split()
>>> response
[b'tesseract-ocr:', b'/usr/share/tesseract-ocr']
>>> type(response), type(response[0])
(<class 'list'>, <class 'bytes'>)
>>>
>>> response.iterdir()
Traceback (most recent call last):
File "<console>", line 1, in <module>
AttributeError: 'list' object has no attribute 'iterdir'
I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata
, and should find it in the second part of response
. So I guess something like this should work?
import subprocess
cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
response = cp.stdout.strip().split()
import pathlib
response_dir = pathlib.Path(response[1].decode("utf-8"))
# response_dir == PosixPath('/usr/share/tesseract-ocr')
for sub_dir in response_dir.iterdir():
for sub_sub_dir in sub_dir.iterdir():
if sub_sub_dir.name.endswith("tessdata"):
tessdata = str(sub_sub_dir)
break
# tessdata == '/usr/share/tesseract-ocr/5/tessdata'
Yeah, I know I should set the TESSDATA_PREFIX
environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?
Thanks for developing PyMuPDF! :)
PyMuPDF version
1.24.9
Operating system
Linux
Python version
3.11