You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When calling .find_tables() with horizontal_strategy="text", text from various cells is excluded.
This was found on a table with a structure similar to this example. The original is sensitive so can't be attached, but the problem persists with the attached document.
If this is expected behaviour (e.g. the boundaries of the table aren't being identified), then I will refer this onto pymupdf/RAG, as this was initially noticed when experimenting with pymupdf4llm.to_markdown() and finding it was leaving important pieces of the document out of the Markdown.
How to reproduce the bug
Input PDF:
Code:
import pymupdf
doc = pymupdf.open("temp.pdf")
result = doc[0].find_tables(horizontal_strategy ="text", vertical_strategy="lines")[0].to_markdown()
Result:
Col1
Col2
Col3
H1B
Col5
H1C
Col7
H1D
A1
D.A1
H2
H3
D.A2
2.0
3.0
D.A3
H2AV
H2B
H2BV
H2C
H2CV
H2D
Info1
Info2
Info3
PyMuPDF version
1.24.7
Operating system
Windows
Python version
3.9
The text was updated successfully, but these errors were encountered:
This is no bug and works as designed.
The default strategy is "lines" in the standard PyMuPDF context, and "lines_strict" under pymupdf4llm. In both cases, the strategy is applied to the x- and y-directions.
The detection in both cases is based on vector graphics coordinates.
The difference is that "lines strict" ignores background colors for cell detection and only takes note of graphics with a stroke color. This often has advantages because of the many cases where text is being highlighted ... without intending to define table cells.
In pymupdf4llm, there is a parameter for setting the strategy (e.g. to "lines", the pymupdf default) for greater flexibility.
Strategy choices are however mutually exclusive: i.e. if you choose "text" then no other criteria (like "lines") in that dimension will be considered.
Your example is a table that is fully defined by gridlines. In such cases, it is counter-productive to specify other strategies than letting it default to "lines" / "lines_strict".
Apologies, I feel like my question and example weren't clear. I'm aware that "text" is not the best strategy for this particular table: however, it is the strategy we're required to use for the real tables. Here's a more accurate reflection:
This example has worse performance, with the following output:
Col1
H1B
Col3
H1C
Col5
Col6
x y
x y
x y
x y
x y
x y
x y
H2AV
H2B
H2BV
H2C
H2CV
H2D
X: y
X: y
My question is: is it expected that Page.find_tables() or pymupydf4llm.to_markdown() will be removing textual information from the page? In our case, it's removing important portions of the document.
(as a note: these example tables were created in Microsoft Word and then exported to PDF)
Description of the bug
When calling
.find_tables()
with horizontal_strategy="text", text from various cells is excluded.This was found on a table with a structure similar to this example. The original is sensitive so can't be attached, but the problem persists with the attached document.
If this is expected behaviour (e.g. the boundaries of the table aren't being identified), then I will refer this onto pymupdf/RAG, as this was initially noticed when experimenting with
pymupdf4llm.to_markdown()
and finding it was leaving important pieces of the document out of the Markdown.How to reproduce the bug
Input PDF:
Code:
Result:
PyMuPDF version
1.24.7
Operating system
Windows
Python version
3.9
The text was updated successfully, but these errors were encountered: