-
Notifications
You must be signed in to change notification settings - Fork 649
Closed
Labels
Fixed in next releasebugfix developedrelease schedule to be determinedrelease schedule to be determined
Description
Description of the bug
When using insert_htmlbox
to render Thai text or long pure number sequences (which do not have natural word breaks), the following problems occur:
-
No Auto-Scaling:
- If the input text cannot be split (for example, Thai text without spaces, or a long number like "12345678901234567890"),
insert_htmlbox
will not auto-scale (shrink) the text to fit the given rectangle. As a result, the content will overflow the boundary and be cut off.
- If the input text cannot be split (for example, Thai text without spaces, or a long number like "12345678901234567890"),
-
Wrong Hyphen for Thai ( handling):
- When Thai text is pre-tokenized (e.g., using PyThaiNLP), and then joined with
­
to represent soft line-break opportunities,insert_htmlbox
may insert a "-" (hyphen) at the line break. However, adding a hyphen at word breaks in Thai is not in line with Thai writing conventions and is visually/semantically incorrect.
- When Thai text is pre-tokenized (e.g., using PyThaiNLP), and then joined with
How to reproduce the bug
-
Thai case:
"bbox": [ 317.98, 201.75, 641.93, 264.3 ] text = '''<span style=\"font-size:60.02pt;color:rgb(255,255,211);\">ค่าธรรมเนียมชำระเมื่อมาถึง</span>''' # or: '''<span style=\"font-size:60.02pt;color:rgb(255,255,211);\">ค่าธรรมเนียม­ชำระ­เมื่อ­มาถึง</span>''' page.insert_htmlbox(rect, text, scale_low=0)
-
Number case:

Expected:
- If the text does not naturally break but overflows the rect,
insert_htmlbox
should apply auto-scaling so the full content fits. - When using
­
as a break point (such as in Thai tokenization), do not insert a hyphen character at the break; in Thai, no such symbol should appear.
Observed:
- No auto-shrinking (scaling) for unbreakable blocks.
- A hyphen
-
is added at Thai­
line breaks, which is not appropriate for Thai (and similarly for Chinese, Japanese, Korean, etc).
PyMuPDF version
1.26.3
Operating system
Windows
Python version
3.10
Metadata
Metadata
Assignees
Labels
Fixed in next releasebugfix developedrelease schedule to be determinedrelease schedule to be determined