-
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Char Unicode Data and Conversions #661
Comments
I'm currently working on emitting JSON in Grain and for escaping I need to generate an UTF-16 surrogate pair from a unicode codepoint. And of course vice-versa, but parsing is still a long way off. |
@cician My question is a tad unrelated to this issue, but I'm not sure I understand—for what you're trying to accomplish, why do you need to make surrogate pairs? Grain strings are UTF-8. |
Actually I don't strictly need it because only ASCII codes 0-31 need to be escaped for conforming JSON output in UTF-8, but I've tentatively added an option to escape all non ASCII characters. The ECMA-404 spec (https://www.ecma-international.org/publications-and-standards/standards/ecma-404/) says the escaping should be done in UTF-16 pairs, unless I misunderstand something. I'm learning in the process about both unicode and Grain. I think it's a consequence of the fact that JSON inherits some properties from JavaScript, which doesn't use UTF-8 internally. It spills to how escaping is done in JavaScript strings and thus JSON. PS: I'm working on it here. |
Ah I see, it's the specification for unicode character escapes that appear within JSON object strings. Got it. That's interesting! So you'd want a utility like |
Or I guess it could just be called |
For now I've just copied a few lines from OpenJDK's source to do the job, but I should probably remove it to avoid copyright/licensing issues. I don't think escapeUtf16 makes much sense as a standalone function as opposed to be part of the JSON specific code, unless we want to build a library like this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html. In java's standard library there are simply two functions like this: char highSurrogate(int codePoint);
char lowSurrogate(int codePoint); In Grain it woudn't make sense to return Char though. These would rather be just numbers with its own specific meaning in unicode slang. |
@peblair I am currently trying to implement the unicode aware functions by generating code based on the Unicode data files (for example https://unicode.org/Public/UNIDATA/UnicodeData.txt). This results in several thousand lines of Map.set code and I read in the Contributing instructions that it should all be contained in a single file. Can I extract the code to another file for readability purposes, or should I just put it all in the char file? |
@FinnRG Thanks for doing some work on this! I think it would make sense to have the data in a separate file, but we may want to hold off on the effort briefly. Once #1330 lands, we will have a more coherent way of working with WASM |
Rust has this little tool for generating efficent bitsets and functions from the spec. https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator I think with a minimal amount of work we could have this generate grain code instead. Digging deeper https://here-be-braces.com/fast-lookup-of-unicode-properties/ looks to cover the standard method for efficently storing all the unicode data, it bassically uses a series of lookup tables to compress the data by up to 99% while ensuring extremely efficient lookups, I think what we want is to develop a small build tool that can bundle the unicode data into a series of tables to be used for this. |
It would be useful for
Char
to support a variety of Unicode-aware query functions and conversion functions (toUpper
,isPunctuation
, etc). For example, these are the ones supported by Racket here and here.We should try to add as many of these as possible, as one never knows what might be useful for libraries.
The text was updated successfully, but these errors were encountered: