Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks when merging PDFs #3201

Closed
cormier opened this issue Feb 23, 2024 · 3 comments
Closed

Memory leaks when merging PDFs #3201

cormier opened this issue Feb 23, 2024 · 3 comments
Labels
fix developed release schedule to be determined Fixed in next release upstream bug bug outside this package

Comments

@cormier
Copy link

cormier commented Feb 23, 2024

Description of the bug

Hello,

First of all, thank you for all the work you've been putting into this project. Last November, I reported a minor memory leak issue related to the save() function, which was promptly addressed and fixed. Thank you for that. However, I've encountered memory leaks under different scenarios since that fix.

Issue Overview:

I've noted memory leaks during the process of merging PDFs. To delve deeper into this issue, I created a test suite of 150 random PDFs from our dataset. Of these, 44 PDFs were identified to cause memory leaks upon merging with another PDF. Included are 42 of these PDFs (I had to remove 2 to respect Github's file size cap) Unfortunately, I couldn't pinpoint the specific issues within each PDF.

reproduce_leaks.tar.gz

Contents of the provides archive

  • A tests directory, containing a subdirectory for each of the 44 test cases where leaks were observed.
  • reproduce.py: A script (referenced in Experiencing small memory leak in save() #2791) that executes 500 merges for each test case to simulate the issue.
  • test_leaks.py: A script that automates the running of reproduce.py across all identified leaking cases.

How to reproduce the bug

  • Unpack the archive into a directory of your choice.
  • Within a Python 3.11 virtual environment, install the required dependencies using pip install -r requirements.txt.
  • Execute the command python test_leaks.py run.

Expected files after execution

Each test case directory will include:

  • content.pdf: The PDF file used in the test.
  • coverpage.pdf: A common PDF file merged with content.pdf in each test, identical across all tests.
  • plot.dat: Memory usage data, which can be visualized with mprof.

To review the memory usage graphs, please use the following command from within the tests directory: for test in $(ls); do mprof plot $test/plot.dat; done

I hope this information aids in troubleshooting the issue. I'm available to provide any further assistance that might be helpful in resolving this.

PyMuPDF version

1.23.25

Operating system

Linux

Python version

3.11

@julian-smith-artifex-com
Copy link
Collaborator

I think i might have found a bug in MuPDF's C++ bindings that could be causing these leaks.

@julian-smith-artifex-com
Copy link
Collaborator

The fix has been pushed to MuPDF branch master.

We're hoping that a new MuPDF release branch will be made soon from MuPDF master, which should fix the issue for the subsequent PyMuPDF release.

[The current PyMuPDF branch main in git works with MuPDF branch master so if you want to try the fix before then, you could use the instructions at https://pymupdf.readthedocs.io/en/latest/installation.html#build-and-install-from-local-pymupdf-checkout-and-optional-local-mupdf-checkout to build PyMuPDF yourself.]

@julian-smith-artifex-com julian-smith-artifex-com added upstream bug bug outside this package fix developed release schedule to be determined labels Feb 29, 2024
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix developed release schedule to be determined Fixed in next release upstream bug bug outside this package
Projects
None yet
Development

No branches or pull requests

2 participants