You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some documents, get_text outputs the wrong literals in words. For instance the text in the pdf reads "Dort machten die Handelsschiffe auf der Überfahrt" but I get "Dort machten die Handelsschiye auf der Überfahrt".
It happens with ff and probably others. When copying from the document in a PDF reader like SumatraPDF, I also get "Dort machten die Handelsschiye auf der Überfahrt".
PyMuPDF version
1.23.x or earlier
Operating system
Windows
Python version
3.11
The text was updated successfully, but these errors were encountered:
You did not include a reproducing file and neither any code snippet.
So this post does not yet qualify as a bug and we are forced to do guesswork:
Your file may use ligatures in the text. "ff" is one of the 6 standard ligatures in Latin text - which means that 1 Unicode (and one glyph) is used to represent multiple characters.
By default, ligatures are passed through in text extraction - however, depending on your output device, they should still look ok.
You can try with a modified text extraction flag bit combination to confirm. E.g. flags=0. This will dissolve ligatures into their components. For details see documentation.
Closing this for lack of response over an extended time interval.
In a future release we will change the text flag default for searches that will no longer preserve ligatures.
Description of the bug
In some documents, get_text outputs the wrong literals in words. For instance the text in the pdf reads "Dort machten die Handelsschiffe auf der Überfahrt" but I get "Dort machten die Handelsschiye auf der Überfahrt".
It happens with ff and probably others. When copying from the document in a PDF reader like SumatraPDF, I also get "Dort machten die Handelsschiye auf der Überfahrt".
PyMuPDF version
1.23.x or earlier
Operating system
Windows
Python version
3.11
The text was updated successfully, but these errors were encountered: