-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encode atoms using UTF-8 #1006
Encode atoms using UTF-8 #1006
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found another problem, I definitely need to look to a number of details of this PR, since introducing UTF-8 support might be tricky in some subtle ways.
As an additional feedback this change should be mentioned in CHANGELOG, and we need to add details to UPDATING.md, if any manual update is required.
46b09dc
to
6efc254
Compare
47b409e
to
4caa340
Compare
FWIW I've been sending Erlang terms from current AtomVM |
src/libAtomVM/externalterm.c
Outdated
buf[0] = ATOM_EXT; | ||
WRITE_16_UNALIGNED(buf + 1, atom_len); | ||
atom_table_write_bytes(glb->atom_table, atom_ref, atom_len, buf + 3); | ||
AtomString atom_str = atom_table_get_atom_string(glb->atom_table, atom_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should phase out using atom_table_get_atom_string
, and use atom_table_write_bytes
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's too bad, because atom_table_write_bytes
requires that I allocate a buffer, right? (whereas atom_table_get_atom_string
just returns the AtomString in the hash table)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we don't have atom_table_get_atom_string
anymore around, we can enable safe code replace and unload for older versions. But that's another topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean by this.
src/libAtomVM/externalterm.c
Outdated
|
||
static inline void encode_atom_utf8(uint8_t *buf, atom_ref_t atom_ref, size_t atom_len, int *offset, GlobalContext *glb) | ||
{ | ||
if (atom_len <= 255) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support atoms longer than 255 bytes, so this will never be true.
So I suggest simplifying our code given this assumption.
Note: BEAM allows atoms up to 255 characters (with meaning of 255 unicode codepoints), that might be longer than 255, only at runtime. Any attempt at compiling this will fail (I suppose that's due to atom encoding in BEAM files).
I think we should properly document this limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have removed support for atoms longer than 255 bytes. Well, not committed yet, but it will be.
src/libAtomVM/externalterm.c
Outdated
@@ -43,6 +44,7 @@ | |||
#define SMALL_BIG_EXT 110 | |||
#define EXPORT_EXT 113 | |||
#define MAP_EXT 116 | |||
#define ATOM_UTF8_EXT 118 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you make this kind of atoms?
Trying erlang:term_to_binary('漢字')
gives me <<131,119,6,230,188,162,229,173,151>>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned in chat, we can encode atoms with this encoding in the shell, e.g.,
4> rp(erlang:term_to_binary('∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y')).
<<131,118,1,168,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121,226,136,128,120,226,136,
131,121,226,136,128,120,226,136,131,121,226,136,128,120,
226,136,131,121,226,136,128,120,226,136,131,121,226,136,
128,120,226,136,131,121,226,136,128,120,226,136,131,121,
226,136,128,120,226,136,131,121>>
and also in the shell, we can decode back to an atom:
6> erlang:display(
6> erlang:binary_to_term(
6> <<131,118,1,168,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6> 131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6> 226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6> 128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6> 226,136,128,120,226,136,131,121>>
6> )).
'∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y'
true
But we fail in AtomVM with
CRASH inary_to_term:
======
pid: <0.1.0>
Stacktrace:
[{test_binary_to_term,start,0,[{file,"/Users/fadushin/work/src/github/fadushin/AtomVM/tests/erlang_tests/test_binary_to_term.erl"},{line,27}]}]
cp: #CP<module: 0, label: 100, offset: 0>
x[0]: error
x[1]: badarg
x[2]: {1,1,90,1,[{0,23}],error}
(using a contrived example) because (it turns out) we don't have a case clause for ATOM_UTF8_EXT
in our deserialization code (starting with calculate_heap_usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, here is my point about this: if ATOM_UTF8_EXT
is used only for encoding atoms that are longer than 255 bytes, there is no point in supporting it, since our atom table assumes that atoms have always length <= 255.
Conversely if ATOM_UTF8_EXT
is used sometimes for encoding also short atoms (such as a 5 bytes atom), then it makes sense supporting it.
My assumption (that might be wrong) is that the most compact encoding is always used, so SMALL_ATOM_UTF8_EXT
for the atoms that we can support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can limit the implementation to not support ATOM_UTF8_EXT
, as there is no appreciable use-case for it, especially for embedded systems.
src/libAtomVM/externalterm.c
Outdated
@@ -43,6 +44,7 @@ | |||
#define SMALL_BIG_EXT 110 | |||
#define EXPORT_EXT 113 | |||
#define MAP_EXT 116 | |||
#define ATOM_UTF8_EXT 118 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, here is my point about this: if ATOM_UTF8_EXT
is used only for encoding atoms that are longer than 255 bytes, there is no point in supporting it, since our atom table assumes that atoms have always length <= 255.
Conversely if ATOM_UTF8_EXT
is used sometimes for encoding also short atoms (such as a 5 bytes atom), then it makes sense supporting it.
My assumption (that might be wrong) is that the most compact encoding is always used, so SMALL_ATOM_UTF8_EXT
for the atoms that we can support.
src/libAtomVM/externalterm.c
Outdated
buf[0] = ATOM_EXT; | ||
WRITE_16_UNALIGNED(buf + 1, atom_len); | ||
atom_table_write_bytes(glb->atom_table, atom_ref, atom_len, buf + 3); | ||
AtomString atom_str = atom_table_get_atom_string(glb->atom_table, atom_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As soon as we don't have atom_table_get_atom_string
anymore around, we can enable safe code replace and unload for older versions. But that's another topic.
src/libAtomVM/externalterm.c
Outdated
for (size_t i = 0; i < atom_len; ) { | ||
size_t out_len = 0; | ||
uint32_t c; | ||
enum UnicodeTransformDecodeResult res = bitstring_utf8_decode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atoms in our atom table always have a valid UTF-8 encoding, so this code will never work.
If you want to know if an atom can be encoded using latin1, you need to check if each codepoint is <= 255.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, the purpose of this function is to determine whether the atom has any extended UTF8 characters, which is part of the enforcement of the following constraint in term_to_binary/2
, when {minor_version, 1}
is set.
From the OTP docs:
Atoms that can be represented by a latin1 string are encoded using latin1 while only atoms that cannot be represented by latin1 are encoded using utf8.
So the idea is that if {minor_version, 1}
is set, and if the atom cannot be encoded as latin1, it will be encoded utf8.
I think we need to evolve this PR to something a bit different: About serialization:
About deserialization:
|
I believe that is what this PR does, in essence (unless you specify
Interesting. Even if the atom is, for example, ASCII? I mean, it does work in that case.
I agree.
Agreed.
Yes, and has not been removed in this PR -- it was always there:
Do you mean in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revert my approval to "request changes". I didn't see the commit called "wip".
6b001a2
to
acd3a57
Compare
Encode atoms using UTF-8 As of OTP-26, atoms are encoded using UTF-8 encoding tags, when using the `term_to_binary/1,2` Nif. This PR adopts the same behavior for AtomVM. Closes #1004. These changes are made under both the "Apache 2.0" and the "GNU Lesser General Public License 2.1 or later" license terms (dual license). SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
As of OTP-26, atoms are encoded using UTF-8 encoding tags, when using the
term_to_binary/1,2
Nif.This PR adopts the same behavior for AtomVM.
Closes #1004.
These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).
SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later