There doesn't seem to be any way to get ActualText values #2889

stevesimmons · 2023-12-13T14:12:27Z

Describe the bug (mandatory)

I have an externally generated PDF where words like "sort" and "office" have ligature-like groups of letters wrapped in ActualText.

Here is the "rt" in "sort" in the source PDF:

/Span<</ActualText (rt) >> BDC
8.0101776 0 Td <92> Tj
EMC

PyMyPDF's page.get_text() extracts them as the unicode replacement character �. The original characters appear to be not recoverable, even with the "xml" or "rawdict" extraction options.

To Reproduce (mandatory)

Try page.get_text() to extract text including the /ActualText span above. Get back "�".

Expected behavior (optional)

It would be great to have an option like TEXT_PRESERVE_ACTUALTEXT that returns the ActualText value "rt" (in the above example) rather than the replacement <92>.

Your configuration (mandatory)

Operating system: Linux, Ubuntu
Python version: 3.11.4
PyMuPDF version: 1.23.2, wheel

The text was updated successfully, but these errors were encountered:

JorjMcKie · 2023-12-13T14:15:32Z

see #2876

JorjMcKie · 2024-01-16T08:35:51Z

This issue has been resolved with version 1.23.14.

JorjMcKie added duplicate enhancement-upstream to be implemented by MuPDF labels Dec 13, 2023

JorjMcKie closed this as completed Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There doesn't seem to be any way to get ActualText values #2889

There doesn't seem to be any way to get ActualText values #2889

stevesimmons commented Dec 13, 2023 •

edited

Loading

JorjMcKie commented Dec 13, 2023

JorjMcKie commented Jan 16, 2024

There doesn't seem to be any way to get ActualText values #2889

There doesn't seem to be any way to get ActualText values #2889

Comments

stevesimmons commented Dec 13, 2023 • edited Loading

Describe the bug (mandatory)

To Reproduce (mandatory)

Expected behavior (optional)

Your configuration (mandatory)

JorjMcKie commented Dec 13, 2023

JorjMcKie commented Jan 16, 2024

stevesimmons commented Dec 13, 2023 •

edited

Loading