You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PyMyPDF's page.get_text() extracts them as the unicode replacement character �. The original characters appear to be not recoverable, even with the "xml" or "rawdict" extraction options.
To Reproduce (mandatory)
Try page.get_text() to extract text including the /ActualText span above. Get back "�".
Expected behavior (optional)
It would be great to have an option like TEXT_PRESERVE_ACTUALTEXT that returns the ActualText value "rt" (in the above example) rather than the replacement <92>.
Your configuration (mandatory)
Operating system: Linux, Ubuntu
Python version: 3.11.4
PyMuPDF version: 1.23.2, wheel
The text was updated successfully, but these errors were encountered:
Describe the bug (mandatory)
I have an externally generated PDF where words like "sort" and "office" have ligature-like groups of letters wrapped in
ActualText
.Here is the "rt" in "sort" in the source PDF:
PyMyPDF's
page.get_text()
extracts them as the unicode replacement character �. The original characters appear to be not recoverable, even with the "xml" or "rawdict" extraction options.To Reproduce (mandatory)
Try
page.get_text()
to extract text including the/ActualText
span above. Get back "�".Expected behavior (optional)
It would be great to have an option like TEXT_PRESERVE_ACTUALTEXT that returns the ActualText value "rt" (in the above example) rather than the replacement <92>.
Your configuration (mandatory)
The text was updated successfully, but these errors were encountered: