Cannot get Tessdata with Tesseract-OCR 5

### Description of the bug

The `pymupdf.get_tessdata()` function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).

```python
>>> import pymupdf
>>> pymupdf.get_tessdata()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<...>/venv/lib/python3.11/site-packages/pymupdf/__init__.py", line 18082, in get_tessdata
    for sub_response in response.iterdir():
                        ^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'iterdir'

>>> pymupdf.version
('1.24.9', '1.24.8', '20240724000001')
```


### How to reproduce the bug

I haven't looked into the details yet, but I think the problem lays here: https://github.com/pymupdf/PyMuPDF/blob/eca70661ae29a75aa4150a4a77f9b8d4e81979cc/src/__init__.py#L18093-L18099

I have the latest Debian with Tesseract OCR 5.3.0, installed in `/usr/share/tesseract-ocr/5/tessdata/`.
The function `get_tessdata()` expects it in `/usr/share/tesseract-ocr/4.00/tessdata`, else it will search it with `whereis tesseract-ocr`.

However, it tries to `iterdir` on the subprocess response, even though it's a list of bytes, which raises the error.

```python
>>> import subprocess
>>> cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
>>> cp
CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'')
>>> response = cp.stdout.strip().split()
>>> response
[b'tesseract-ocr:', b'/usr/share/tesseract-ocr']
>>> type(response), type(response[0])
(<class 'list'>, <class 'bytes'>)
>>> 
>>> response.iterdir()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'list' object has no attribute 'iterdir'
```

I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with `tessdata`, and should find it in the second part of `response`. So I guess something like this should work?

```python
import subprocess
cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
response = cp.stdout.strip().split()
import pathlib
response_dir = pathlib.Path(response[1].decode("utf-8"))
# response_dir == PosixPath('/usr/share/tesseract-ocr')
for sub_dir in response_dir.iterdir():
    for sub_sub_dir in sub_dir.iterdir():
        if sub_sub_dir.name.endswith("tessdata"):
            tessdata = str(sub_sub_dir)
            break
# tessdata == '/usr/share/tesseract-ocr/5/tessdata'
```

Yeah, I know I should set the `TESSDATA_PREFIX` environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?

Thanks for developing PyMuPDF! :)

### PyMuPDF version

1.24.9

### Operating system

Linux

### Python version

3.11

	# determine tessdata via iteration over subfolders
	tessdata = None
	for sub_response in response.iterdir():
	for sub_sub in sub_response.iterdir():
	if str(sub_sub).endswith("tessdata"):
	tessdata = sub_sub
	break

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot get Tessdata with Tesseract-OCR 5 #3767

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot get Tessdata with Tesseract-OCR 5 #3767

Description

Description of the bug

How to reproduce the bug

PyMuPDF version

Operating system

Python version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions