Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding skip = "auto" with serializeVersion = 3 #201

Closed
wlandau opened this issue Feb 23, 2024 · 5 comments
Closed

Understanding skip = "auto" with serializeVersion = 3 #201

wlandau opened this issue Feb 23, 2024 · 5 comments

Comments

@wlandau
Copy link
Contributor

wlandau commented Feb 23, 2024

With digest version 0.6.34 and serialization version 3, @shikokuchuo observed that hashes on the same object may differ for different locales:

$ LANG="C" R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "bdef078af943dd2546be047d2044d8b5"

$ R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "a611bfa70eb5dcc0a248ed0369794237"

@shikokuchuo figured out that the extra headers from serialization V3 are hashed along with the contents of the object. (So the hashes agree if serialization = 2 or skip = 23.)

It would be nice to understand this choice for the default skip = "auto". I am not sure if this is really a digest issue because of how it relates to serialize().

@eddelbuettel
Copy link
Owner

eddelbuettel commented Feb 23, 2024

Interesting. I suspect that this is related to the long (and for its OP, frustrating) discussion in #200. At the end of the day, digest is a fairly straightforward 'collector' and 'dispatcher' of a given serialization string for a chosen hashing and "digesting" algorithm (among a moderatly large and complete selection of such algorithms).

So if and when you are in situations where the raw bytes from serialize() differ, as I suspect they do here, there is not much we can do apart from pointing upstream ...

Recall in the different locales we may indeed be handed different strings from R by our own choices so seeing a difference strikes as quite plausible.

$ Rscript -e 'cat(serialize(1L, NULL, TRUE, version=3), "\n")'
41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 35 0a 55 54 46 2d 38 0a 31 33 0a 31 0a 31 0a 
$ LANG="C" Rscript -e 'cat(serialize(1L, NULL, TRUE, version=3), "\n")'
41 0a 33 0a 32 36 32 39 31 34 0a 31 39 37 38 38 38 0a 31 34 0a 41 4e 53 49 5f 58 33 2e 34 2d 31 39 36 38 0a 31 33 0a 31 0a 31 0a 
$ 

If you use version=2 the issue seems to go away as you say.

@wlandau
Copy link
Contributor Author

wlandau commented Feb 26, 2024

Thanks for explaining.

@wlandau wlandau closed this as completed Feb 26, 2024
@eddelbuettel
Copy link
Owner

eddelbuettel commented Feb 26, 2024

We could possibly set up a more 'discerning digest' that strips what it cans. Might be worth discussing. I of course see why serialize() does what it does and see that as good default for digest() -- after all it should be a digest of what R thinks of an object and differ when things it differs -- but there may be reasons when we want just a subset.

But I don't right now see a way to strip something like LANG without doing callr gymnastics which may be a bridge too far. Any thoughts or ideas from your end?

PS One added complication is that some environment variables that govern the process are hard / impossible to alter once the process (for us: the R session) is running. Hm.

@wlandau
Copy link
Contributor Author

wlandau commented Feb 26, 2024

But I don't right now see a way to strip something like LANG without doing callr gymnastics which may be a bridge too far. Any thoughts or ideas from your end?

I'm afraid I don't understand enough about what's happening at this depth, but in secretbase, @shikokuchuo apparently found a way to robustly remove headers without relying on a fixed number of bytes. C.f. shikokuchuo/secretbase#5 (comment), shikokuchuo/secretbase#5 (comment)

@eddelbuettel
Copy link
Owner

Yes customizing consumption of what comes from serialize() would be one way. Possibly not the lowest-risk approach, but possibly also the only one.

Note that I have 'borrowed' a snapshot of serialization API already in RApiSerialize() (for use in Redis and other) so that may be a way too. But I won't have time to dig there anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants