Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I just spent a day with PyMuPDF and it gave me fitz! #2681

Closed
gnash1 opened this issue Sep 21, 2023 · 16 comments
Closed

I just spent a day with PyMuPDF and it gave me fitz! #2681

gnash1 opened this issue Sep 21, 2023 · 16 comments

Comments

@gnash1
Copy link

gnash1 commented Sep 21, 2023

Both Google Bard and ChatGPT recommended (free marketing!) that I use PyMuPDF (AKA fitz) for my project, so I spent some time trying to make is work and was surprised how difficult the implementation was.

It looks like a great tool with excellent documentation(https://pymupdf.readthedocs.io/en/latest), however it also looks like the code is native C/C++ and is automatically converted to Python using SWIG and published and I think this is what is causing issues.
After conversion the code needs testing in order to ensure that it is actually working, starting with the examples here: https://pymupdf.readthedocs.io/en/latest/the-basics.html

Also a side note that the conversion. It seems to create code that while "working", is not elegant and could be improved with a native Python eye. A simple example is changing loops on indexes to loops on objects.

Environment: PyCharm 2022.2.3 running Python 3.9-64 on Windows 10 Version 22H2
Download PyCharm and test here: https://www.jetbrains.com/pycharm/download

Issues:
Setup in PyCharm has an initial conflict with:
fitz 0.0.1.dev2 from Erik Kastman http://github.com/kastman/fitz
Resolution: Uninstall the beta "fitz" version (that I believe comes installed with PyCharm) and install fitz (PyMuPDF verion 1.23.3)
pip install --force-reinstall pymupdf
Documentation: Consider including this in the here?: https://pymupdf.readthedocs.io/en/latest/installation.html
References:
https://stackoverflow.com/questions/67112724/fitz-open-not-working-when-in-a-for-loop-fitz-python-pymupdf/77143227#77143227
https://stackoverflow.com/questions/76985067/i-am-having-an-import-error-with-the-fitz-library-in-pycharm

doc = fitz.open("test.pdf")
Error: Cannot find reference 'open' in 'init.py | init.py'
Workaround: fitz.Document()
Documentation: https://pymupdf.readthedocs.io/en/latest/the-basics.html#opening-a-file
References: https://youtrack.jetbrains.com/issue/PY-48110

This started the pattern of being forced to try and work out a way forward accessing deprecated functions when the new ones do not work.

Examples of errors in the IDE (You don't have to run the program to see them)

page.get_pixmap() returns the error: Unresolved attribute reference 'get_pixmap' for class 'Page'
Workaround: is to call a deprecated function: pixMap = fitz.utils.get_pixmap(page=page)
Documetionation: https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_pixmap
"Create a pixmap from the page. This is probably the most often used method to create a Pixmap." - and yet it doesn't work!'
Line 147 in init.py is this alias working as expected?: fitz.Page.get_pixmap = fitz.utils.get_pixmap

page.insert_image() returns the error: Unresolved attribute reference 'insert_image' for class 'Page'
Workaround: xref = fitz.utils.insert_image(rect=new_page.bound(), page=new_page, pixmap=pixMap)
Documetionation: https://pymupdf.readthedocs.io/en/latest/page.html#Page.insert_image

new_page() returns the error: Unresolved attribute reference 'new_page' for class 'Document'
Workaround: fitz.utils.new_page(doc_new)
Documentation: https://pymupdf.readthedocs.io/en/latest/document.html#Document.new_page

Too many problems with what looks like amazing code that seems to be untested in Python, so I will be moving on to another option, but hope these can be easily remedied in the future so that the quality of the code matches the quality of the documentation.

@julian-smith-artifex-com
Copy link
Collaborator

julian-smith-artifex-com commented Sep 21, 2023

We automatically run the PyMuPDF test suite every day on multiple machines (Windows, Linux and MacOS), and many thousands of people are using PyMuPDF around the world without the basic problems you're seeing.

If PyCharm comes with fitz pre-installed, then i'm afraid this is a problem with PyCharm, not PyMuPDF. Maybe you could ask the PyCharm people about this.

And if basic things like fitz.open() and page.insert_image() are not working after you've removed fitz, then i'm afraid that your Python and/or PyCharm installation is fundamentally broken.

I'll try to install PyCharm and see whether i can reproduce the problems you're seeing.

In the meantime, it would be good if you could try PyMuPDF in a clean Python environment that is totally separate from PyCharm. This can be done really easily (change "zlib.3.pdf" to a local PDF file):

Edit: First delete any existing pylocal/ directory, so we know we are using a completely fresh venv.

py -m venv pylocal
pylocal\Scripts\activate
python -m pip install --upgrade pip
pip install pymupdf
python
>>> import fitz
>>> document = fitz.open("zlib.3.pdf")
>>> for page in document:
...     print(page)
... 
page 0 of mupdf/thirdparty/zlib/zlib.3.pdf
page 1 of mupdf/thirdparty/zlib/zlib.3.pdf
>>> 

@julian-smith-artifex-com
Copy link
Collaborator

I've just installed PyCharm 2023.2.1 on a Windows machine and it all works fine for me.

  • It did not have fitz pre-installed.
  • Adding PyMuPDF using Python Packages works.
  • A test programme that uses fitz.open() works.

Installing fitz breaks things, as expected, with ModuleNotFoundError: No module named 'frontend.

To recover:

  • On Python Package (stack icon on lower left hand side of main window), search for fitz and do Delete Package.
  • This changes the test programme error to AttributeError: module 'fitz' has no attribute 'open'.
  • There is no option to reinstall PyMuPDF.
  • Instead, on PyMuPDF do Delete Package and then Install package.
  • The test programme works again.

@gnash1
Copy link
Author

gnash1 commented Sep 21, 2023 via email

@julian-smith-artifex-com
Copy link
Collaborator

What you have reported indicates that re-installation of PyMuPDF in PyCharm has not worked, but to figure out what is going on, we need more information.

  • Given that things work fine for me with PyCharm, to make more progress i need to see exactly what is happening on your machine.
  • Please follow the exact instructions i gave and report back how you get on.
  • Include screenshots showing presence/absence of the fitz and mupdfpy packages as you go.
  • Even better would be to make a video of your PyCharm window while you uninstall and re-install PyMuPDF and rerun your test programme.

You have not shown me any details of what happens when you run the venv session i described.

  • Please follow the example commands i gave (starting with py -m venv pylocal) and report back with full information on what happens on your machine.
  • Send the contents of your terminal, showing all the commands you're running and their full output.
  • To ensure it's a clean venv, first delete any existing pylocal/ directory. [I've edited the earlier post to say this.]

More generally, testing in a clean non-PyCharm environment is really important. It will tell us whether the problem is in Pycharm, or elsewhere. Once we know that, we will have more information that can be used to understand what may be going on in PyCharm, and on other peoples' machines.

Regarding the links you gave:

@gnash1
Copy link
Author

gnash1 commented Sep 23, 2023

Here are the steps to uninstall and reinstall...TLDR - no change in outcome in a direct uninstall reinstall or within a local venv.

Here are the suggested steps to recover from 9/21 comment:

• On Python Package (stack icon on lower left hand side of main window), search for fitz and do Delete Package.
• This changes the test programme error to AttributeError: module 'fitz' has no attribute 'open'.
• There is no option to reinstall PyMuPDF.
• Instead, on PyMuPDF do Delete Package and then Install package.
• The test programme works again.

Loading this sample code into a fresh module here is what I have:
https://pymupdf.readthedocs.io/en/latest/the-basics.html#extract-text-from-a-pdf

Fitz is not installed:

image

Initial load shows that open() is not recognized.

image

image

Removing PyMuPDF & PyMuPDFb

image

Removed:
image

In the above screen shot notice that nothing is out of date (the up arrow is grey not white) Pip is already up to date, but running this again just to be sure.

image

image

image

image

No change in the behavior:

image

Trying the same within a new venv given the same results. Here are the screen shots showing each step:

Install and select a new pylocal\venv
image

Upgrade all default packages and install PyMuPDF
image

image

fitz.open() is still not recognized.

I just created a new project using the pylocal venv and still experience the same result.
image

image

@julian-smith-artifex-com
Copy link
Collaborator

Thanks for the screenshots.

For me, when PyCharm runs my project pythonProject, does it in a venv called C:\Users\jules\PycharmProjects\pythonProject\venv. So i'd expect for you that PyCharm will use a venv called something like C:\Users\...\PycharmProjects\AHT_DataMigration\venv.

So your running of pip in a separate command window will not have effected your PyCharm project. [In fact it looks like it is modifying a separate per-user Python installation.]

If you want to try manually add/remove packages to the venv in a separate command window, you need to do so in the PyCharm venv. For example:

C:\Users\...> PycharmProjects\AHT_DataMigration\venv\Scripts\activate
(venv) C:\Users\...> pip install ...

Separately from that, i don't know why your PyCharm is not installing PyMuPDF correctly. The only thing i can suggest is that you first get into a clean state where import fitz gives an error ModuleNotFoundError: no module named 'fitz'. If you can't do this, then there's a problem with PyCharm, and you'll need to ask the PyCharm people for help.

The other thing is to test completely outside of PyCharm, so i look forward to hearing how you get on with the suggested commands in a pylocal venv. Make sure you delete any existing pylocal directory first, and include a full log from your terminal window.

@gnash1
Copy link
Author

gnash1 commented Sep 25, 2023

I just retested, deleting all previous projects, creating a new project with a new virtual environment. Upgrading all packages and installing PyMuPDF and may have a situation that is working, however it is using an import that is different than suggested by the documentation. Here are some pictures:
image
image
When trying to upgrade pip from the PyCharm console here is what I see, not sure if this is helpful.
image
Base packages updated:
image
PyMuPDF installed
image
Trying out the imports
image
image

image

image

Of the three import options 'from fitz_new import fitz' works.
image

I switched back to the base 3.9-64 Interpreter and the script works.
image

If you don't have any other suggestions I will run with this option, though it is different than the documentation. Possibly consider adding it here if this is the suggested resolution? https://pymupdf.readthedocs.io/en/latest/installation.html

@julian-smith-artifex-com
Copy link
Collaborator

1 The failure of python -m pip install ... is because you're trying to run the commit inside the Python interpreter - see the >>> prompt.]

2 I do not trust PyCharm tooltips as seen in your screenshots; i've seen them claim there is an error when the code actually works, specifically with fitz.open(). You need to actually run your programme and look at whatever error messages (if any) are generated by Python itself.

3 I'd still like to see you get PyCharm into a state where running your programme stops at import fitz with error ModuleNotFoundError: no module named 'fitz', and send the screenshot, and then install PyMuPDF and run again.

4 fitz_new is a new implementation of PyMuPDF, see #2680 for details. Doing from fitz_new import fitz will work but:

  • The recommended way of using the new rebased implementation is: import fitz_new as fitz.
  • It's avoiding the original problem with import fitz, instead of understanding and fixing it. I worry that your PyCharm installation is broken, and might cause even more confusing problems later on.

5 As i've said a few times now, we really need to know what happens when you test completely outside of PyCharm:

  • Delete any pylocal/ directory.
  • Run these commands in a Cmd windows:
    py -m venv pylocal
    pylocal\Scripts\activate
    python -m pip install --upgrade pip
    pip install pymupdf
    python
    >>> import fitz
    >>> document = fitz.open("zlib.3.pdf")
    >>> for page in document:
    ...     print(page)
    ... 
    page 0 of mupdf/thirdparty/zlib/zlib.3.pdf
    page 1 of mupdf/thirdparty/zlib/zlib.3.pdf
    >>> 
    
  • Send the contents of your terminal, showing all the commands you're running and their full output.

@gnash1
Copy link
Author

gnash1 commented Sep 25, 2023

Okay weird! Switching to "import fitz" in the virtual environment works even though it looks like it shouldn't with the warning:

image

However, as you said, I hit run (with the error) and it works. I understand that you would like to know how this works outside of Python, would it be possible to jump on a call to explore this? I do not plan to use Python outside Pycharm, so while I am happy to help make this experience better for others, running outside PyCharm isn't really my goal.

@julian-smith-artifex-com
Copy link
Collaborator

We don't need to do a call. Running outside of PyCharm was never about making the experience better for others - we already know PyMuPDF works fine on many thousands of systems.

Instead, as i've explained, my requests for you to run outside of PyCharm was to better understand what was going on on your machine. It's been quite frustrating to spend so much time trying to help you, but still have these requests repeatedly ignored, and this made the whole process take longer then it should have done.

Anyway, aside from that, i'm glad that PyCharm is finally working for you.

@gnash1
Copy link
Author

gnash1 commented Sep 25, 2023

Seems like this is working, so not sure how this really helps...

image

@gnash1
Copy link
Author

gnash1 commented Sep 25, 2023

Rather than having to remember that "import fitz" isn't really an error I think I will stick with "import fitz_new as fitz". Also I will post on the PyCharm channels the solution we come us with, but really that's as much time as I can spend on this. This effort was originally for a pdf conversion that was already accomplished and delivered with ghostscript when fitz and PyCharm didn't work well together. Without blaming anyone, fitz didn't initially work for me, so I moved on to another option to accomplish the goal. Because I have already delivered, raising the issue with Artifex is in fact for someone else to see this dialog in the future and also an FYI for an end user. In my original note, I included others posts (in different environments) going through the same problematic journey. Leaving fault aside, this is not an isolated issue. Your help has been extremely prompt, which is both impressive and commendable, however also at the same time rude to a messenger. Instead of saying things like "we already know PyMuPDF works fine on many thousands of systems", I would hope that you would consider proactively updating the Installation Documentation to mention that you have seen the following that the next person down this path would need:

https://pymupdf.readthedocs.io/en/latest/installation.html

Known Issues:

  • Another fitz 0.0.1.dev2 from Erik Kastman package that is a known naming conflict with steps to resolve.
  • PyCharm sometimes incorrectly flags 'fitz.open' as an error; try running anyways.
  • Take ownership and responsibility and let the community know that you have reached out to PyCharm and work with them on this issue. AI may encroach on the technical details of support issues like this one, however I don't see it taking over human connections between vendors, suppliers, and customers that are the backbone to profitable business.

@gnash1
Copy link
Author

gnash1 commented Sep 26, 2023

Correction on the above. For me "import fitz_new as fitz" didn't work, so I went back to simply "import fitz", which works even though fitz.open() flags as a warning in Pycharm. Here is the test function put together to compress pdfs. It works VERY well and is wicked fast! Thanks for your help getting this working Julian.

`
import fitz #Installed via PyMuPDF.

def compress_pdf(input_pdf_path, output_pdf_path, garbage=3, colorspace=fitz.csRGB, dpi=72):
doc = fitz.open(input_pdf_path)
doc_new = fitz.open()
for page in doc:
pixmap = page.get_pixmap(colorspace=colorspace, dpi=dpi, annots=False)
new_page = doc_new.new_page(-1)
xref = new_page.insert_image(rect=new_page.bound(), pixmap=pixmap)
doc_new.save(output_pdf_path, garbage=garbage, deflate=True, deflate_images=True, deflate_fonts=True, pretty=True)
doc.close()
doc_new.close()
`

image

@nattu22
Copy link

nattu22 commented Jul 9, 2024

def merge_pdfs(folder_path, files, output_filename, ZipFolderPath):
print(output_filename, len(files))
pdf_writer = fitz.open() # Create a PDF writer object

for file in files:
    pdf_reader = fitz.open(os.path.join(folder_path, file))  # Open each PDF file
    pdf_writer.insert_pdf(pdf_reader)  # Insert pages into the writer
    pdf_reader.close()
mergedpdf = os.path.join(ZipFolderPath, output_filename)
pdf_writer.save(mergedpdf, deflate=True, garbage=3)  # Save the merged PDF
pdf_writer.close()
print("Pdf merged", mergedpdf)

I am trying to merge 10000 of pdfs into 1 pdf, while using this code. I am getting kernal died error. Can someone suggest me how to do it effectively?

@julian-smith-artifex-com
Copy link
Collaborator

This issue was closed many months ago; please create a new issue for your problem.

Also, we'll need to see the complete output of your programme in order to figure out what is going wrong.

@gnash1
Copy link
Author

gnash1 commented Jul 9, 2024

I actually went another direction after not getting PyCharm to play along nicely with fitz. Not picking a side here, just going with what works to finish the job.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants