-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can I know in advance what the output size will be from the input in unishox2? #55
Comments
Usually same size as input is sufficient unless the input is binary data or wierd text that would expand the original input. Right now there is no such function in the implementation. |
Is there a safe, upper-bound for how much Unishox2 might expand worst-case input? |
It is not possible to derive the upper-bound due to the nature of compression. For example, if we were to compress repeating sequences of any character, say '-', regardless of the number of characters, even if in megabytes, the compressed byte count won't exceed 5 or 6 bytes. If it is not possible to allocate an arbitrary sized buffer, there are 2 options:
|
Thanks - this works well for decompression. For compression, |
Theoretically speaking, the highest number of bits is occupied by the letter Z or #, both of which take up 12 bits. This disregards dictionary, delta and repeating character encodings as they will all fall within this worst case occupancy. Assuming a combination of such worst case letters, I would say that the upper bound is 1.5n, where n is the number of letters in the string. However, practically most real life strings will fit within n and those exceeding n would be computer generated or hypothetical strings. So I am not sure if the upper bound is strictly necessary for practical purposes. In most cases allocating a buffer which is a bit more than n is sufficiently inclusive. This is the reason I added the adjective "Guaranteed compression" initially which I removed later as it was a bit misleading. |
We're considering using Unishox2 to compress some short texts containing user-generated strings. Please consider this an API request for a function, taking parameter of number of input bytes, that computes the worst-case compression output buffer needed. Would it just be |
@bryghtlabs-richard Yes the formula you have mentioned is quite alright although as much is not needed unless the user will generate uncompressible strings. There is a parameter in the API to mention the length of output buffer so that Unishox2 will not write beyond this limit. |
How big buffer should be allocated for output before calling compress and decompress function?
The text was updated successfully, but these errors were encountered: