Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zimdump: new function to analyze ZIM content size with hints about compression factor #957

Open
benoit74 opened this issue Feb 24, 2025 · 4 comments

Comments

@benoit74
Copy link

benoit74 commented Feb 24, 2025

For openzim/mwoffliner#2180 I had to analyze the ZIM content.

I did it with python-libzim binding because I'm way more comfortable with it.

The struggle I had (which luckily was not blocker) is that while it is possible to have access to an Item size (uncompressed AFAIK), I did not found any way to get its compressed size. It was hence hard to be 100% sure where the increased ZIM size went from.

Is that mostly normal since there is no such compressed size, because we only compress the cluster, not individual items? Or is it just something which is missing in the binding(s)? Should I have used another tool / zimtool to do this analysis?

At least having a rough estimation of compression factor for every item would help to analyze a bit deeper such situations. Maybe simply exposing clusters, and which cluster is used by which item, and every cluster compression factor (compressed and uncompressed size for instance) would be sufficient.

@veloman-yunkan
Copy link
Collaborator

Is that mostly normal since there is no such compressed size, because we only compress the cluster, not individual items?

Exactly.

@veloman-yunkan
Copy link
Collaborator

At least having a rough estimation of compression factor for every item would help to analyze a bit deeper such situations. Maybe simply exposing clusters, and which cluster is used by which item, and every cluster compression factor (compressed and uncompressed size for instance) would be sufficient.

Another option is to output proper statistics during/after ZIM creation.

@benoit74
Copy link
Author

Another option is to output proper statistics during/after ZIM creation.

Or with a dedicated tool (here I want to analyze a ZIM right after creation, but I can imagine other scenarii where it might have been a while and creation logs are not here anymore).

Maybe a feature for zimdump?

@kelson42
Copy link
Contributor

Maybe a feature for zimdump?

This is what I though as I first read your issue.

@benoit74 benoit74 changed the title How to analyze ZIM content size? zimdump: new function to analyze ZIM content size with hints about compression factor Feb 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants