Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode atoms using UTF-8 #1006

Merged
merged 0 commits into from
Feb 27, 2024
Merged

Encode atoms using UTF-8 #1006

merged 0 commits into from
Feb 27, 2024

Conversation

fadushin
Copy link
Collaborator

@fadushin fadushin commented Jan 1, 2024

As of OTP-26, atoms are encoded using UTF-8 encoding tags, when using the term_to_binary/1,2 Nif.

This PR adopts the same behavior for AtomVM.
Closes #1004.

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

@fadushin fadushin requested a review from pguyot January 3, 2024 13:17
src/libAtomVM/nifs.c Outdated Show resolved Hide resolved
src/libAtomVM/externalterm.c Outdated Show resolved Hide resolved
src/libAtomVM/bitstring.c Outdated Show resolved Hide resolved
.github/workflows/run-tests-with-beam.yaml Outdated Show resolved Hide resolved
Copy link
Collaborator

@bettio bettio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found another problem, I definitely need to look to a number of details of this PR, since introducing UTF-8 support might be tricky in some subtle ways.

As an additional feedback this change should be mentioned in CHANGELOG, and we need to add details to UPDATING.md, if any manual update is required.

src/libAtomVM/externalterm.c Outdated Show resolved Hide resolved
@fadushin fadushin force-pushed the utf8-atoms branch 2 times, most recently from 46b09dc to 6efc254 Compare January 23, 2024 01:43
@fadushin fadushin requested a review from bettio January 25, 2024 02:11
UPDATING.md Outdated Show resolved Hide resolved
doc/src/programmers-guide.md Outdated Show resolved Hide resolved
@fadushin fadushin force-pushed the utf8-atoms branch 2 times, most recently from 47b409e to 4caa340 Compare January 30, 2024 01:56
@arpunk
Copy link
Contributor

arpunk commented Feb 7, 2024

FWIW I've been sending Erlang terms from current AtomVM master using this branch to OTP 26.x applications without any issue when it comes to atom encoding.

@bettio bettio changed the base branch from main to release-0.6 February 23, 2024 20:11
buf[0] = ATOM_EXT;
WRITE_16_UNALIGNED(buf + 1, atom_len);
atom_table_write_bytes(glb->atom_table, atom_ref, atom_len, buf + 3);
AtomString atom_str = atom_table_get_atom_string(glb->atom_table, atom_index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should phase out using atom_table_get_atom_string, and use atom_table_write_bytes instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's too bad, because atom_table_write_bytes requires that I allocate a buffer, right? (whereas atom_table_get_atom_string just returns the AtomString in the hash table)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as we don't have atom_table_get_atom_string anymore around, we can enable safe code replace and unload for older versions. But that's another topic.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean by this.


static inline void encode_atom_utf8(uint8_t *buf, atom_ref_t atom_ref, size_t atom_len, int *offset, GlobalContext *glb)
{
if (atom_len <= 255) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't support atoms longer than 255 bytes, so this will never be true.
So I suggest simplifying our code given this assumption.

Note: BEAM allows atoms up to 255 characters (with meaning of 255 unicode codepoints), that might be longer than 255, only at runtime. Any attempt at compiling this will fail (I suppose that's due to atom encoding in BEAM files).

I think we should properly document this limitation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have removed support for atoms longer than 255 bytes. Well, not committed yet, but it will be.

@@ -43,6 +44,7 @@
#define SMALL_BIG_EXT 110
#define EXPORT_EXT 113
#define MAP_EXT 116
#define ATOM_UTF8_EXT 118
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you make this kind of atoms?

Trying erlang:term_to_binary('漢字') gives me <<131,119,6,230,188,162,229,173,151>>.

Copy link
Collaborator Author

@fadushin fadushin Feb 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in chat, we can encode atoms with this encoding in the shell, e.g.,

4> rp(erlang:term_to_binary('∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y')).
<<131,118,1,168,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121,226,136,128,120,226,136,
  131,121,226,136,128,120,226,136,131,121,226,136,128,120,
  226,136,131,121,226,136,128,120,226,136,131,121,226,136,
  128,120,226,136,131,121,226,136,128,120,226,136,131,121,
  226,136,128,120,226,136,131,121>>

and also in the shell, we can decode back to an atom:

6> erlang:display(
6>         erlang:binary_to_term(
6>             <<131,118,1,168,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121,226,136,128,120,226,136,
6>             131,121,226,136,128,120,226,136,131,121,226,136,128,120,
6>             226,136,131,121,226,136,128,120,226,136,131,121,226,136,
6>             128,120,226,136,131,121,226,136,128,120,226,136,131,121,
6>             226,136,128,120,226,136,131,121>>
6>   )).
'∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y∀x∃y'
true

But we fail in AtomVM with

CRASH inary_to_term:
======
pid: <0.1.0>

Stacktrace:
[{test_binary_to_term,start,0,[{file,"/Users/fadushin/work/src/github/fadushin/AtomVM/tests/erlang_tests/test_binary_to_term.erl"},{line,27}]}]

cp: #CP<module: 0, label: 100, offset: 0>

x[0]: error
x[1]: badarg
x[2]: {1,1,90,1,[{0,23}],error}

(using a contrived example) because (it turns out) we don't have a case clause for ATOM_UTF8_EXT in our deserialization code (starting with calculate_heap_usage

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, here is my point about this: if ATOM_UTF8_EXT is used only for encoding atoms that are longer than 255 bytes, there is no point in supporting it, since our atom table assumes that atoms have always length <= 255.

Conversely if ATOM_UTF8_EXT is used sometimes for encoding also short atoms (such as a 5 bytes atom), then it makes sense supporting it.

My assumption (that might be wrong) is that the most compact encoding is always used, so SMALL_ATOM_UTF8_EXT for the atoms that we can support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can limit the implementation to not support ATOM_UTF8_EXT, as there is no appreciable use-case for it, especially for embedded systems.

@@ -43,6 +44,7 @@
#define SMALL_BIG_EXT 110
#define EXPORT_EXT 113
#define MAP_EXT 116
#define ATOM_UTF8_EXT 118
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, here is my point about this: if ATOM_UTF8_EXT is used only for encoding atoms that are longer than 255 bytes, there is no point in supporting it, since our atom table assumes that atoms have always length <= 255.

Conversely if ATOM_UTF8_EXT is used sometimes for encoding also short atoms (such as a 5 bytes atom), then it makes sense supporting it.

My assumption (that might be wrong) is that the most compact encoding is always used, so SMALL_ATOM_UTF8_EXT for the atoms that we can support.

buf[0] = ATOM_EXT;
WRITE_16_UNALIGNED(buf + 1, atom_len);
atom_table_write_bytes(glb->atom_table, atom_ref, atom_len, buf + 3);
AtomString atom_str = atom_table_get_atom_string(glb->atom_table, atom_index);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As soon as we don't have atom_table_get_atom_string anymore around, we can enable safe code replace and unload for older versions. But that's another topic.

tests/erlang_tests/test_binary_to_term.erl Outdated Show resolved Hide resolved
for (size_t i = 0; i < atom_len; ) {
size_t out_len = 0;
uint32_t c;
enum UnicodeTransformDecodeResult res = bitstring_utf8_decode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

atoms in our atom table always have a valid UTF-8 encoding, so this code will never work.
If you want to know if an atom can be encoded using latin1, you need to check if each codepoint is <= 255.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the purpose of this function is to determine whether the atom has any extended UTF8 characters, which is part of the enforcement of the following constraint in term_to_binary/2, when {minor_version, 1} is set.

From the OTP docs:

Atoms that can be represented by a latin1 string are encoded using latin1 while only atoms that cannot be represented by latin1 are encoded using utf8.

So the idea is that if {minor_version, 1} is set, and if the atom cannot be encoded as latin1, it will be encoded utf8.

@bettio
Copy link
Collaborator

bettio commented Feb 24, 2024

I think we need to evolve this PR to something a bit different:

About serialization:

  1. we must stop using ATOM_EXT (our implementation wasn't even correct) and replace it with SMALL_ATOM_UTF8_EXT
  2. there is no reason to support latin1 when doing serialization, we never supported it for real (if you try that right now garbage is produced). So I don't think it really makes sense to accept serialization options right now.
  3. We don't need to support ATOM_UTF8_EXT in serialization codepath, since we don't support atoms longer than 255 bytes.

About deserialization:

  1. I'm not sure if exists any practical scenario in which ATOM_UTF8_EXT is used for encoding an atom shorter than 256 bytes, longer atoms cannot be supported from us.
  2. ATOM_EXT must be kept in deserialization codepath, but it must encode received buffer to real utf-8. (I can take care of this)
  3. SMALL_ATOM_UTF8_EXT must verify that received buffer is valid utf8. (I can take care of this)

@fadushin
Copy link
Collaborator Author

About serialization:

  1. we must stop using ATOM_EXT (our implementation wasn't even correct) and replace it with SMALL_ATOM_UTF8_EXT

I believe that is what this PR does, in essence (unless you specify latin1 encoding).

  1. there is no reason to support latin1 when doing serialization, we never supported it for real (if you try that right now garbage is produced). So I don't think it really makes sense to accept serialization options right now.

Interesting. Even if the atom is, for example, ASCII? I mean, it does work in that case.

  1. We don't need to support ATOM_UTF8_EXT in serialization codepath, since we don't support atoms longer than 255 bytes.

I agree.

About deserialization:

  1. I'm not sure if exists any practical scenario in which ATOM_UTF8_EXT is used for encoding an atom shorter than 256 bytes, longer atoms cannot be supported from us.

Agreed.

  1. ATOM_EXT must be kept in deserialization codepath, but it must encode received buffer to real utf-8. (I can take care of this)

Yes, and has not been removed in this PR -- it was always there:

        case ATOM_EXT: {
            uint16_t atom_len = READ_16_UNALIGNED(external_term_buf + 1);

            int global_atom_id = globalcontext_insert_atom_maybe_copy(glb, (AtomString) (external_term_buf + 2), copy);

            *eterm_size = 3 + atom_len;
            return term_from_atom_index(global_atom_id);
        }
  1. SMALL_ATOM_UTF8_EXT must verify that received buffer is valid utf8. (I can take care of this)

Do you mean in parse_external_term?

@bettio bettio changed the base branch from release-0.6 to main February 25, 2024 17:24
@bettio bettio changed the base branch from main to release-0.6 February 25, 2024 17:24
Copy link
Collaborator

@bettio bettio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revert my approval to "request changes". I didn't see the commit called "wip".

@fadushin fadushin force-pushed the utf8-atoms branch 3 times, most recently from 6b001a2 to acd3a57 Compare February 27, 2024 01:51
bettio added a commit that referenced this pull request Feb 27, 2024
Encode atoms using UTF-8

As of OTP-26, atoms are encoded using UTF-8 encoding tags, when using the
`term_to_binary/1,2` Nif.

This PR adopts the same behavior for AtomVM.
Closes #1004.

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later
@bettio bettio merged commit a471bb5 into atomvm:release-0.6 Feb 27, 2024
85 checks passed
@fadushin fadushin deleted the utf8-atoms branch February 27, 2024 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants