Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There doesn't seem to be any way to get ActualText values #2889

Closed
stevesimmons opened this issue Dec 13, 2023 · 2 comments
Closed

There doesn't seem to be any way to get ActualText values #2889

stevesimmons opened this issue Dec 13, 2023 · 2 comments
Labels
duplicate enhancement-upstream to be implemented by MuPDF

Comments

@stevesimmons
Copy link

stevesimmons commented Dec 13, 2023

Describe the bug (mandatory)

I have an externally generated PDF where words like "sort" and "office" have ligature-like groups of letters wrapped in ActualText.

Here is the "rt" in "sort" in the source PDF:

/Span<</ActualText (rt) >> BDC
8.0101776 0 Td <92> Tj
EMC

PyMyPDF's page.get_text() extracts them as the unicode replacement character �. The original characters appear to be not recoverable, even with the "xml" or "rawdict" extraction options.

To Reproduce (mandatory)

Try page.get_text() to extract text including the /ActualText span above. Get back "�".

Expected behavior (optional)

It would be great to have an option like TEXT_PRESERVE_ACTUALTEXT that returns the ActualText value "rt" (in the above example) rather than the replacement <92>.

Your configuration (mandatory)

  • Operating system: Linux, Ubuntu
  • Python version: 3.11.4
  • PyMuPDF version: 1.23.2, wheel
@JorjMcKie JorjMcKie added duplicate enhancement-upstream to be implemented by MuPDF labels Dec 13, 2023
@JorjMcKie
Copy link
Collaborator

see #2876

@JorjMcKie
Copy link
Collaborator

This issue has been resolved with version 1.23.14.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate enhancement-upstream to be implemented by MuPDF
Projects
None yet
Development

No branches or pull requests

2 participants