Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture printed error when input pdf file is corrupted #3649

Closed
han-xiao-upright opened this issue Jul 2, 2024 · 2 comments
Closed

Capture printed error when input pdf file is corrupted #3649

han-xiao-upright opened this issue Jul 2, 2024 · 2 comments

Comments

@han-xiao-upright
Copy link

han-xiao-upright commented Jul 2, 2024

Is your feature request related to a problem? Please describe.

I'd like to implement a functionality which processes multiple PDF files one by one.

Some PDFs are "valid" while some of the PDF files are corrupted, in which case they should be ignored.

In my case, reading corrupted PDF files make pymupdf prints out a list of errors, instead of raising them.

For instance, the following code

import pymupdf

doc = pymupdf.open("/path/to/a/corrupted/file.pdf")
p0 = doc[0]
p0.get_pixmap(dpi=100)

gives

MuPDF error: library error: zlib error: incorrect header check

MuPDF error: format error: cmsOpenProfileFromMem failed

MuPDF error: library error: zlib error: incorrect header check

MuPDF error: syntax error: syntax error in content stream

MuPDF error: syntax error: syntax error in content stream

Describe the solution you'd like

Arguable, raising the errors makes handling them easier. So I would like the following:

  • instead of printing the error and continue to subsequent code, raise an exception and stop there

Perhaps via a configurable keyword argument

doc = pymupdf.open("/path/to/a/corrupted/file.pdf")
p0 = doc[0]
p0.get_pixmap(dpi=100, raise_error=True)

Describe alternatives you've considered
Are there several options for how your request could be met?

Additional context
Add any other context or screenshots about the feature request here.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Jul 2, 2024

Not all errors require raising an exception. On the contrary, MuPDF strives to keep processing by falling back to whatever repair mechanisms.
You can suppress the display of these messages by setting a global parameter via pymupdf.TOOLS.mupdf_display_errors(False).
In any case, you can extract the error and warning messages (all collected in the same pymupdf string variable) via pymupdf.TOOLS.mupdf_warnings(reset=True). Each call with True empties that variable.
In your case you already today could do this

import pymupdf
pymupdf.TOOLS.mupdf_display_errors(False)

# then, at any desired spot (e.g. pixmap creation) do this:
pix = page.get_pixmap()
msg = pymupdf.TOOLS.mupdf_warnings(reset=True)
if "error" in msg:
    raise RuntimeError(msg)

I am very much against an implementation as you indicated it. How many dozens or hundreds of methods would we have to change?
A focused implementation like indicated above serves the same purpose.

@han-xiao-upright
Copy link
Author

Thanks for the insight! Closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants