Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I know in advance what the output size will be from the input in unishox2? #55

Open
NuttyNull opened this issue Aug 21, 2023 · 7 comments

Comments

@NuttyNull
Copy link

How big buffer should be allocated for output before calling compress and decompress function?

@siara-cc
Copy link
Owner

Usually same size as input is sufficient unless the input is binary data or wierd text that would expand the original input. Right now there is no such function in the implementation.

@bryghtlabs-richard
Copy link

Is there a safe, upper-bound for how much Unishox2 might expand worst-case input?

@siara-cc
Copy link
Owner

siara-cc commented Feb 4, 2025

It is not possible to derive the upper-bound due to the nature of compression. For example, if we were to compress repeating sequences of any character, say '-', regardless of the number of characters, even if in megabytes, the compressed byte count won't exceed 5 or 6 bytes.

If it is not possible to allocate an arbitrary sized buffer, there are 2 options:

  • Store the size of original buffer in the form of variable integer so the exact size can be allocated before decompression
  • Store the compression ratio as a code so that slightly larger buffer can be allocated before decompression

@bryghtlabs-richard
Copy link

Store the size of original buffer in the form of variable integer so the exact size can be allocated before decompression

Thanks - this works well for decompression.

For compression, unishox2_compress_simple() is given N random bytes, is there a method that can compute the worst-case 'compressed' data size, so that the needed out buffer may be allocated safely?

@siara-cc
Copy link
Owner

siara-cc commented Feb 4, 2025

Theoretically speaking, the highest number of bits is occupied by the letter Z or #, both of which take up 12 bits. This disregards dictionary, delta and repeating character encodings as they will all fall within this worst case occupancy.

Assuming a combination of such worst case letters, I would say that the upper bound is 1.5n, where n is the number of letters in the string.

However, practically most real life strings will fit within n and those exceeding n would be computer generated or hypothetical strings.

So I am not sure if the upper bound is strictly necessary for practical purposes. In most cases allocating a buffer which is a bit more than n is sufficiently inclusive. This is the reason I added the adjective "Guaranteed compression" initially which I removed later as it was a bit misleading.

@bryghtlabs-richard
Copy link

bryghtlabs-richard commented Feb 4, 2025

We're considering using Unishox2 to compress some short texts containing user-generated strings. Please consider this an API request for a function, taking parameter of number of input bytes, that computes the worst-case compression output buffer needed. Would it just be (int)(N * 1.5 + 1) ?

@siara-cc
Copy link
Owner

siara-cc commented Feb 5, 2025

@bryghtlabs-richard Yes the formula you have mentioned is quite alright although as much is not needed unless the user will generate uncompressible strings. There is a parameter in the API to mention the length of output buffer so that Unishox2 will not write beyond this limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants