Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Char Unicode Data and Conversions #661

Open
peblair opened this issue May 17, 2021 · 9 comments
Open

Char Unicode Data and Conversions #661

peblair opened this issue May 17, 2021 · 9 comments

Comments

@peblair
Copy link
Member

peblair commented May 17, 2021

It would be useful for Char to support a variety of Unicode-aware query functions and conversion functions (toUpper, isPunctuation, etc). For example, these are the ones supported by Racket here and here.

We should try to add as many of these as possible, as one never knows what might be useful for libraries.

@cician
Copy link
Contributor

cician commented May 29, 2021

I'm currently working on emitting JSON in Grain and for escaping I need to generate an UTF-16 surrogate pair from a unicode codepoint. And of course vice-versa, but parsing is still a long way off.

@ospencer
Copy link
Member

@cician My question is a tad unrelated to this issue, but I'm not sure I understand—for what you're trying to accomplish, why do you need to make surrogate pairs? Grain strings are UTF-8.

@cician
Copy link
Contributor

cician commented May 30, 2021

Actually I don't strictly need it because only ASCII codes 0-31 need to be escaped for conforming JSON output in UTF-8, but I've tentatively added an option to escape all non ASCII characters.

The ECMA-404 spec (https://www.ecma-international.org/publications-and-standards/standards/ecma-404/) says the escaping should be done in UTF-16 pairs, unless I misunderstand something. I'm learning in the process about both unicode and Grain. I think it's a consequence of the fact that JSON inherits some properties from JavaScript, which doesn't use UTF-8 internally. It spills to how escaping is done in JavaScript strings and thus JSON.

PS: I'm working on it here.

@ospencer
Copy link
Member

Ah I see, it's the specification for unicode character escapes that appear within JSON object strings. Got it. That's interesting! So you'd want a utility like Char.escapeSurrogatePair : Char -> String that would take a char and return its unicode escape as a surrogate pair, e.g. assert Char.escapeSurrogatePair('𝄞') == "\\uD834\\uDD1E"? That'd differ from Char.escape which would just produce "\\u{1D11E}" for regular Grain strings, yeah?

@ospencer
Copy link
Member

ospencer commented May 30, 2021

Or I guess it could just be called escapeUtf16.

@cician
Copy link
Contributor

cician commented May 30, 2021

For now I've just copied a few lines from OpenJDK's source to do the job, but I should probably remove it to avoid copyright/licensing issues.

I don't think escapeUtf16 makes much sense as a standalone function as opposed to be part of the JSON specific code, unless we want to build a library like this: https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

In java's standard library there are simply two functions like this:

char highSurrogate(int codePoint);
char lowSurrogate(int codePoint);

In Grain it woudn't make sense to return Char though. These would rather be just numbers with its own specific meaning in unicode slang.

@FinnRG
Copy link
Contributor

FinnRG commented Jun 18, 2022

@peblair I am currently trying to implement the unicode aware functions by generating code based on the Unicode data files (for example https://unicode.org/Public/UNIDATA/UnicodeData.txt). This results in several thousand lines of Map.set code and I read in the Contributing instructions that it should all be contained in a single file. Can I extract the code to another file for readability purposes, or should I just put it all in the char file?

@peblair
Copy link
Member Author

peblair commented Jun 18, 2022

@FinnRG Thanks for doing some work on this! I think it would make sense to have the data in a separate file, but we may want to hold off on the effort briefly. Once #1330 lands, we will have a more coherent way of working with WASM data sections in Grain, which I think can give us a much more efficient way of storing the data in UnicodeData.txt (that way we avoid having thousands of Map.set calls on startup).

@spotandjake
Copy link
Member

spotandjake commented Dec 29, 2023

Rust has this little tool for generating efficent bitsets and functions from the spec. https://github.com/rust-lang/rust/tree/master/src/tools/unicode-table-generator I think with a minimal amount of work we could have this generate grain code instead.

Digging deeper https://here-be-braces.com/fast-lookup-of-unicode-properties/ looks to cover the standard method for efficently storing all the unicode data, it bassically uses a series of lookup tables to compress the data by up to 99% while ensuring extremely efficient lookups, I think what we want is to develop a small build tool that can bundle the unicode data into a series of tables to be used for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants