Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information dropped when using horizontal_strategy="text" in Page.find_tables() #3675

Closed
will-0 opened this issue Jul 11, 2024 · 2 comments
Closed
Labels
not a bug not a bug / user error / unable to reproduce

Comments

@will-0
Copy link

will-0 commented Jul 11, 2024

Description of the bug

When calling .find_tables() with horizontal_strategy="text", text from various cells is excluded.

This was found on a table with a structure similar to this example. The original is sensitive so can't be attached, but the problem persists with the attached document.

If this is expected behaviour (e.g. the boundaries of the table aren't being identified), then I will refer this onto pymupdf/RAG, as this was initially noticed when experimenting with pymupdf4llm.to_markdown() and finding it was leaving important pieces of the document out of the Markdown.

How to reproduce the bug

Input PDF:

image

Code:

import pymupdf   
doc = pymupdf.open("temp.pdf")
result = doc[0].find_tables(horizontal_strategy ="text", vertical_strategy="lines")[0].to_markdown()

Result:

Col1 Col2 Col3 H1B Col5 H1C Col7 H1D
A1 D.A1
H2 H3 D.A2
2.0 3.0 D.A3
H2AV H2B H2BV H2C H2CV H2D
Info1
Info2
Info3

PyMuPDF version

1.24.7

Operating system

Windows

Python version

3.9

@JorjMcKie
Copy link
Collaborator

This is no bug and works as designed.
The default strategy is "lines" in the standard PyMuPDF context, and "lines_strict" under pymupdf4llm. In both cases, the strategy is applied to the x- and y-directions.
The detection in both cases is based on vector graphics coordinates.

The difference is that "lines strict" ignores background colors for cell detection and only takes note of graphics with a stroke color. This often has advantages because of the many cases where text is being highlighted ... without intending to define table cells.
In pymupdf4llm, there is a parameter for setting the strategy (e.g. to "lines", the pymupdf default) for greater flexibility.

Strategy choices are however mutually exclusive: i.e. if you choose "text" then no other criteria (like "lines") in that dimension will be considered.
Your example is a table that is fully defined by gridlines. In such cases, it is counter-productive to specify other strategies than letting it default to "lines" / "lines_strict".

@JorjMcKie JorjMcKie added the not a bug not a bug / user error / unable to reproduce label Jul 11, 2024
@will-0
Copy link
Author

will-0 commented Jul 11, 2024

@JorjMcKie thanks for the response.

Apologies, I feel like my question and example weren't clear. I'm aware that "text" is not the best strategy for this particular table: however, it is the strategy we're required to use for the real tables. Here's a more accurate reflection:

image

This example has worse performance, with the following output:

Col1 H1B Col3 H1C Col5 Col6
x y x y
x y x y
x y x y
x y
H2AV H2B H2BV H2C H2CV H2D
X: y
X: y

My question is: is it expected that Page.find_tables() or pymupydf4llm.to_markdown() will be removing textual information from the page? In our case, it's removing important portions of the document.

(as a note: these example tables were created in Microsoft Word and then exported to PDF)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not a bug not a bug / user error / unable to reproduce
Projects
None yet
Development

No branches or pull requests

2 participants