Skip to content

Commit

Permalink
updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
matgat committed Feb 11, 2024
1 parent 08b711a commit ca8ad1b
Showing 1 changed file with 79 additions and 62 deletions.
141 changes: 79 additions & 62 deletions readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,14 +45,17 @@ int main()

---
### Decode bytes to *utf-32* string
- Function
- `to_utf32<INENC>(…)` or `to_utf32(…)`
- Inputs
Converts bytes to a *utf-32* string

to_utf32<INENC>(…)
to_utf32(…)

- *Input*
- `std::string_view` encoded as `INENC` or `std::u8string_view`
- Return value
- *Return value*
- `std::u32string`
- Description
- Converts bytes to a *utf-32* string



[example](https://gcc.godbolt.org/z/co81s7PYx)

Expand All @@ -65,14 +68,15 @@ std::u32string u32str2 = utxt::to_utf32(u8"..."sv);

---
### Encode *utf-32* to *utf-8*
- Function
- `to_utf8(…)
- Inputs
Encodes a `char32_t` string or codepoint to a *utf-8* string

to_utf8(…)

- *Input*
- `std::u32string_view` or `char32_t`
- Return value
- *Return value*
- `std::string` bytes encoded as *utf-8* (avoiding `std::u8string` until better support in *stdlib*)
- Description
- Encodes a `char32_t` string or codepoint to a *utf-8* string


[example](https://gcc.godbolt.org/z/c1TbvKdrP)

Expand All @@ -83,53 +87,63 @@ std::cout << utxt::to_utf8(U"..."sv)

---
### Encode *utf-32* to bytes
Encodes a `char32_t` string or codepoint

encode_as<OUTENC>(…)

- Function
- `encode_as<OUTENC>(…)`
- `encode_as(OUTENC,…)` (if `OUTENC` is not known at compile time)
- Inputs
- *Inputs*
- `utxt::Enc OUTENC` output encoding
- `std::u32string_view` or `char32_t` codepoints to encode
- Return value
- *Return value*
- `std::string` encoded bytes as `OUTENC`
- Description
- Encodes a `char32_t` string or codepoint.
*The run time version contains a `switch` that instantiates the correct template*

[example](https://gcc.godbolt.org/z/5eMWxar1Y)

```cpp
using enum utxt::Enc;
std::string out_bytes1 = utxt::encode_as<UTF16BE>(U"..."sv);
std::string out_bytes2 = utxt::encode_as(UTF16BE, U"..."sv);
std::string out_bytes = utxt::encode_as<UTF16BE>(U"..."sv);
```

In case `OUTENC` is not known at compile time, there's an
alternate version of this function that chooses the
correct template at runtime:

```cpp
std::string out_bytes = utxt::encode_as(UTF16BE, U"..."sv);
```


---
### Re-encode bytes detecting input encoding
- Function
- `encode_as<OUTENC>(…)`
- `encode_as(OUTENC,…)` (if `OUTENC` is not known at compile time)
- Inputs
Re-encodes a string of bytes (detecting its encoding) to a given encoding

encode_as<OUTENC>(…)

- *Inputs*
- `utxt::Enc OUTENC` output encoding
- `std::string_view` input bytes of unknown encoding
- `utxt::flags_t` if specified `flag::SKIP_BOM` output won't contain the byte order mask
- Return value
- *Return value*
- `std::string` output bytes encoded as `OUTENC`
- Description
- Re-encodes a string of bytes (detecting its encoding) to a given encoding.
*The run time version contains a `switch` that instantiates the correct template*


[example](https://gcc.godbolt.org/z/jf3Wh9jrK)

```cpp
using enum utxt::Enc;
std::string_view in_bytes = "..."sv;
std::string out_bytes1 = utxt::encode_as<UTF8>(in_bytes);
std::string out_bytes2 = utxt::encode_as(UTF8, in_bytes);
std::string out_bytes = utxt::encode_as<UTF8>(in_bytes);
static_assert( utxt::encode_as<UTF16BE>(U'🔥') == "\xD8\x3D\xDD\x25"sv );
```
In case `OUTENC` is not known at compile time, there's an
alternate version of this function that chooses the
correct template at runtime:
```cpp
std::string out_bytes = utxt::encode_as(UTF8, in_bytes);
```

Alternate functions that take a buffer and return a `string_view`
are provided to skip the re-encoding in case the input and output
encodings are the same:
Expand All @@ -140,23 +154,28 @@ encodings are the same:
using enum utxt::Enc;
std::string_view in_bytes = "..."sv;
std::string maybe_reencoded_buf;
std::string_view out_bytes1 = utxt::encode_if_necessary_as<UTF8>(in_bytes, maybe_reencoded_buf);
std::string_view out_bytes2 = utxt::encode_if_necessary_as(UTF8, in_bytes, maybe_reencoded_buf);
std::string_view out_bytes = utxt::encode_if_necessary_as<UTF8>(in_bytes, maybe_reencoded_buf);
```

The corresponding runtime version:

```cpp
std::string_view out_bytes = utxt::encode_if_necessary_as(UTF8, in_bytes, maybe_reencoded_buf);
```


---
### Re-encode bytes
- Functions
- `reencode<INENC,OUTENC>(…)`
- Inputs
Re-encodes a string of bytes from one encoding to another

reencode<INENC,OUTENC>(…)

- *Inputs*
- `utxt::Enc INENC` input encoding
- `utxt::Enc OUTENC` output encoding
- `std::string_view` input bytes encoded as `INENC`
- Return value
- *Return value*
- `std::string` output bytes encoded as `OUTENC`
- Description
- Re-encodes a string of bytes from one encoding to another

[example](https://gcc.godbolt.org/z/rrsj6cnf4)

Expand All @@ -183,7 +202,6 @@ std::string_view out_bytes = utxt::reencode_if_necessary<INENC,OUTENC>(in_bytes,

---
### `bytes_buffer_t` class

A class that represents a byte stream interpreted with a given encoding.

[example](https://gcc.godbolt.org/z/xM76nYqxW)
Expand All @@ -202,15 +220,15 @@ if( bytes_buf.has_bytes() )
---
### Encoding Detection
Detects the encoding of raw bytes,
it just detects the byte order mask, no euristic analysis of bytes
- Function
- `detect_encoding_of(…)`
- Inputs
detect_encoding_of(…)
- *Input*
- `std::string_view` raw bytes
- Return value
- *Return value*
- `struct{ Enc enc; std::uint8_t bom_size; }`
- Description
- Detects the encoding of raw bytes - it just detects the byte order mask, no euristic analysis of bytes
[example](https://gcc.godbolt.org/z/3hTa49sbE)
Expand All @@ -227,18 +245,18 @@ switch(bytes_enc)

---
### Decoding a single codepoint
- Function
- `extract_codepoint<Enc>(…)`
- Inputs
Extracts a codepoint from a string of raw bytes interpreted with encoding `Enc`,
updating the current position that points to the data.

extract_codepoint<Enc>(…)

- *Inputs*
- `std::string_view` raw bytes encoded as `Enc`
- `std::size_t&` current position
- Preconditions
- *Preconditions*
- Assumes enough remaining bytes to extract the codepoint, undefined behavior otherwise
- Return value
- *Return value*
- `char32_t` extracted codepoint, `codepoint::invalid` in case of decoding errors
- Description
- Extracts a codepoint from a string of raw bytes interpreted with encoding `Enc`,
updating the current position that points to the data.

[example](https://gcc.godbolt.org/z/4a3WGee5c)

Expand All @@ -253,15 +271,14 @@ assert( cp == U'🔥' );
---
### Encoding a single codepoint
- Function
- `append_codepoint<Enc>(…)`
- Inputs
Appends a codepoint to a given string of bytes using encoding `Enc`
append_codepoint<Enc>(…)
- *Input*
- `char32_t` codepoint to encode
- *Output*
- `std::string&` destination bytes encoded as `Enc`
- Return value
- `void`
- Description
- Appends a codepoint to a given string of bytes using encoding `Enc`
[example](https://gcc.godbolt.org/z/YW1h643nW)
Expand Down

0 comments on commit ca8ad1b

Please sign in to comment.