Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add icu4x-key-extract for Static Data Slicing #1460

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 10 additions & 5 deletions .github/workflows/build-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,10 @@ jobs:

steps:
- uses: actions/checkout@v2
- name: Install Node.js v14.17.0
uses: actions/setup-node@v1
with:
node-version: 14.17.0
- name: Load nightly Rust toolchain for WASM.
run: |
rustup install nightly-2021-12-22
Expand Down Expand Up @@ -320,15 +324,16 @@ jobs:
with:
command: make
args: wasm-release
- name: Install Node.js v14.17.0
uses: actions/setup-node@v1
with:
node-version: 14.17.0
- name: Build
- name: Test
uses: actions-rs/cargo@v1.0.1
with:
command: make
args: wasm-test-release
- name: Build Examples and test icu4x-key-extract
uses: actions-rs/cargo@v1.0.1
with:
command: make
args: wasm-compare-worklog-keys
# This has to be a separate test since the emscripten sdk
# will otherwise interfere with other node-using tests
- name: Run emscripten test
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ dhat-heap.json

# Do not check-in bincode test data
provider/testdata/data/bincode
tools/datagen/tests/testdata/work_log_bincode

# Ignore irrelevant files that get generated on macOS
**/.DS_Store
Expand Down
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 3 additions & 1 deletion Makefile.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ dependencies = [
"test-all-features",
"test-docs-default",
"test-docs",
"testdata-build-bincode-all",
"testdata-build-worklog-bincode",
"testdata-check",
]

Expand Down Expand Up @@ -91,6 +91,8 @@ dependencies = [
# we have to set up the environment for the emscripten job separately
# Instead, each of these is called individually.
"wasm-release",
"wasm-test-release",
"wasm-compare-worklog-keys",
"wasm-cpp-emscripten",
]

Expand Down
86 changes: 86 additions & 0 deletions provider/core/src/extract.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/main/LICENSE ).

//! Utilities for extracting `ResourceKey` objects from a byte stream. Requires the "std" feature.

use crate::prelude::*;
use std::io;
use std::io::BufRead;
use std::io::BufReader;

/// Extracts [`ResourceKey`] objects from a byte stream.
///
/// This function looks for all occurrences of the `repr(C)` of `ResourceKey` in the byte stream,
/// and reads them back as Rust objects.
///
/// The byte stream is often the output of a compiler, like a WASM file or an x86 executable.
///
/// To run this function as a command-line tool, use `icu4x-key-extract`.
pub fn extract_keys_from_byte_stream(stream: impl io::Read) -> io::Result<Vec<ResourceKey>> {
let mut reader = BufReader::with_capacity(1024, stream);
let mut working_buffer = [0u8; 1024 + 39];
let mut output = Vec::new();
loop {
let reader_buffer = reader.fill_buf()?;
if reader_buffer.is_empty() {
break;
}
let len = reader_buffer.len();
// Save 39 bytes from iteration to iteration: one less than a 40-byte window
working_buffer[39..(39 + len)].copy_from_slice(reader_buffer);
for window in working_buffer[..(39 + len)].windows(40) {
if &window[0..8] == b"ICURES[[" && &window[36..40] == b"]]**" {
let mut bytes: [u8; 40] = [0; 40];
bytes.copy_from_slice(window);
let resc_key = match ResourceKey::from_repr_c(&bytes) {
Some(k) => k,
None => continue,
};
output.push(resc_key);
}
}
reader.consume(len);
working_buffer.copy_within(len.., 0);
}
output.sort();
output.dedup();
Ok(output)
}

#[cfg(test)]
mod test {
use super::*;
use crate::resource_key;

const GOLDEN_BYTES: &[u8] = b"\x00\x00ICURES[[\x02\x00\x00\x00\x00\x00\x00\x00skeletons\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00]]**ICURES[[\x02\x00\x00\x00\x00\x00\x00\x00symbols\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00]]**\x00\x00";

#[test]
fn test_extract_golden() {
let keys = extract_keys_from_byte_stream(&*GOLDEN_BYTES).unwrap();
assert_eq!(
keys,
vec![
resource_key!(DateTime, "skeletons", 1),
resource_key!(DateTime, "symbols", 1),
]
);
}

#[test]
fn test_extract_large() {
let keys: Vec<ResourceKey> = (0u8..=255u8)
.map(|i| resource_key!(Core, "demo", i))
.collect();
let mut buffer: Vec<u8> = Vec::new();
for key in keys.iter() {
// Insert some garbage
buffer.extend(b"\x00ICURES[[\x00\x00]]**\x00\x00");
// This is safe because we are transmuting to a POD type
let key_bytes: [u8; 40] = unsafe { core::mem::transmute(*key) };
buffer.extend(&key_bytes);
}
let extracted_keys = extract_keys_from_byte_stream(&*buffer).unwrap();
assert_eq!(keys, extracted_keys);
}
}
2 changes: 2 additions & 0 deletions provider/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,8 @@ mod error;
#[macro_use]
pub mod erased;
pub mod export;
#[cfg(feature = "std")]
pub mod extract;
pub mod filter;
pub mod hello_world;
pub mod inv;
Expand Down
81 changes: 67 additions & 14 deletions provider/core/src/resource.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,13 @@ use core::default::Default;
use core::fmt;
use core::fmt::Write;
use icu_locid::LanguageIdentifier;
use tinystr::{TinyStr16, TinyStr4};
use tinystr::{tinystr4, tinystr8, TinyStr16, TinyStr4, TinyStr8};
use writeable::{LengthHint, Writeable};

/// A top-level collection of related resource keys.
#[non_exhaustive]
#[derive(PartialEq, Eq, PartialOrd, Ord, Copy, Clone, Debug)]
#[repr(C)]
pub enum ResourceCategory {
Core,
Calendar,
Expand Down Expand Up @@ -79,10 +80,62 @@ impl writeable::Writeable for ResourceCategory {
///
/// Use [`resource_key!`] as a shortcut to create resource keys in code.
#[derive(PartialEq, Eq, PartialOrd, Ord, Copy, Clone)]
#[repr(C)]
pub struct ResourceKey {
_tag0: TinyStr8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thought: not a huge fan of this, but I guess it's no big deal and it does seem to be the best way to achieve this

pub category: ResourceCategory,
pub sub_category: TinyStr16,
pub version: u16,
pub version: u8,
_tag1: TinyStr4,
}
Comment on lines +83 to +90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to put these changes behind a feature? If this is going to be transparent to the client, we can also set the feature correctly and avoid the tags in production code.


impl ResourceKey {
/// Creates a new [`ResourceKey`].
pub const fn new(
category: ResourceCategory,
sub_category: TinyStr16,
version: u8,
) -> ResourceKey {
ResourceKey {
_tag0: tinystr8!("ICURES[["),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this going to work for a BE binary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will work so long as the fields are endian-agnostic. I changed the version field to be u8 instead of u16 to address this problem. The only field I'm not sure about is ResourceCategory, but I left a TODO to fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For TinyStr in particular, there's a conversation about it. Short answer is that it actually is endian-safe.

zbraniecki/tinystr#45

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ResourceCategory it will be fine as long as it's repr(u8).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ResourceCategory is dataful, which is why it is hard. The other problem is that I intend to remove ResourceCategory, so I don't want to spend a lot of time solving it. Also, even if ResourceCategory were repr(u8), it's still not safe to transmute, because the u8 could be out-of-range.

What do you suggest as the short-term solution?

  1. Check in this unsafe code first and remove ResourceCategory later
  2. Remove ResourceCategory first before merging this PR
  3. Make ResourceCategory safe, and then delete that code when I remove ResourceCategory

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So tinystr8!("foo") has the same bit pattern whether it's compiled for LE or BE?

Regarless of the answer to that question, I think it would be better to use byte arrays (and byte string literals). They are a primitive, are endian-agnostic, and don't require a future reader to be familiar with tinystr's implementation details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I did not realize ResourceCategory is dataful.

I'm okay with having this code for now but I'd prioritize fixing this up.

@robertbastian yes, because it has the same bit pattern as the string, and strings are endianness-independent. In fact I've thought about making tinystr internally use byte arrays in the first place.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it overcomplicates things to write a TinyStr but then compare to a byte array.

category,
sub_category,
version,
_tag1: tinystr4!("]]**"),
}
}

/// Recovers a [`ResourceKey`] from its `repr(C)` bytes.
///
/// Returns `None` if the bytes are not a valid [`ResourceKey`].
///
/// # Examples
///
/// ```
/// use icu_provider::prelude::*;
///
/// let demo_key = icu_provider::resource_key!(Core, "demo", 1);
/// // This is safe because we are transmuting to a POD type
/// let repr_c_bytes: [u8; 40] = unsafe {
/// core::mem::transmute(demo_key)
/// };
/// let recovered_key = ResourceKey::from_repr_c(&repr_c_bytes)
/// .expect("The bytes are valid");
///
/// assert_eq!(demo_key, recovered_key);
/// ```
pub fn from_repr_c(bytes: &[u8; 40]) -> Option<ResourceKey> {
// Smoke check
if &bytes[0..8] != b"ICURES[[" || &bytes[36..40] != b"]]**" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b"ICURES[[" and b"]]** are hardcoded in three places (new, from_repr_c, and extract.rs), maybe define a const?

return None;
}

// TODO(#1457): Use a ULE-like code path here.
// TODO(#1457): This is not safe!
// - We can't currently verify the ResourceCategory!
// - TinyStr does not currently have a function that takes a byte *array* (with NULs).
unsafe { Some(core::mem::transmute(*bytes)) }
}
}

/// Shortcut to construct a const resource identifier.
Expand Down Expand Up @@ -119,11 +172,11 @@ macro_rules! resource_key {
)
};
($category:expr, $sub_category:literal, $version:tt) => {
$crate::ResourceKey {
category: $category,
sub_category: $crate::internal::tinystr16!($sub_category),
version: $version,
}
$crate::ResourceKey::new(
$category,
$crate::internal::tinystr16!($sub_category),
$version,
)
};
}

Expand Down Expand Up @@ -433,20 +486,20 @@ mod tests {
expected: "core/cardinal@1",
},
KeyTestCase {
resc_key: ResourceKey {
category: ResourceCategory::PrivateUse(tinystr4!("priv")),
sub_category: tinystr::tinystr16!("cardinal"),
version: 1,
},
resc_key: ResourceKey::new(
ResourceCategory::PrivateUse(tinystr4!("priv")),
tinystr::tinystr16!("cardinal"),
1,
),
expected: "x-priv/cardinal@1",
},
KeyTestCase {
resc_key: resource_key!(Core, "maxlengthsubcatg", 1),
expected: "core/maxlengthsubcatg@1",
},
KeyTestCase {
resc_key: resource_key!(Core, "cardinal", 65535),
expected: "core/cardinal@65535",
resc_key: resource_key!(Core, "cardinal", 255),
expected: "core/cardinal@255",
},
]
}
Expand Down
6 changes: 4 additions & 2 deletions provider/core/tests/sizes.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
use icu_provider::prelude::*;
use static_assertions::const_assert_eq;

const_assert_eq!(8, core::mem::size_of::<tinystr::TinyStr8>());
const_assert_eq!(8, core::mem::size_of::<ResourceCategory>());
const_assert_eq!(16, core::mem::size_of::<tinystr::TinyStr16>());
const_assert_eq!(4, core::mem::size_of::<u32>());
const_assert_eq!(32, core::mem::size_of::<ResourceKey>());
const_assert_eq!(1, core::mem::size_of::<u8>());
const_assert_eq!(4, core::mem::size_of::<tinystr::TinyStr4>());
const_assert_eq!(40, core::mem::size_of::<ResourceKey>());
5 changes: 5 additions & 0 deletions tools/datagen/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ all-features = true
[dependencies]
eyre = "0.6"
clap = "2.33"
either = "1.6"
futures = "0.3"
icu_locid = { version = "0.4", path = "../../components/locid", features = ["std"]}
icu_properties = { version = "0.4", path = "../../components/properties", features = ["std"]}
Expand All @@ -52,3 +53,7 @@ path = "src/bin/datagen.rs"
[[bin]]
name = "icu4x-testdata-download"
path = "src/bin/testdata-download.rs"

[[bin]]
name = "icu4x-key-extract"
path = "src/bin/key-extract.rs"
37 changes: 19 additions & 18 deletions tools/datagen/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,44 +6,45 @@ The tools include:

1. `icu4x-datagen`: Read source data (CLDR JSON) and dump ICU4X-format data.
2. `icu4x-testdata-download`: Download fresh CLDR JSON for testdata.
3. `icu4x-key-extract`: Extract `ResourceKey` objects present in a compiled executable.

More details on each tool can be found by running `--help`.

## Examples

Generate ICU4X JSON file tree:
Generate ICU4X Postcard blob (single file) for all keys and all locales:

```bash
# Run from the icu4x project folder
$ cargo run --bin icu4x-datagen -- \
--cldr-tag 39.0.0 \
--all-keys \
--all-locales \
--out /tmp/icu4x_data/json
--cldr-tag 39.0.0 \
--all-keys \
--all-locales \
--format blob \
--out /tmp/icu4x_data/icu4x_data.postcard
```

Generate ICU4X Postcard blob (single file):
Extract the keys from an executable into a key file:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "Extract the keys used by an executable into a key file"


```bash
# Run from the icu4x project folder
$ cargo run --bin icu4x-datagen -- \
--cldr-tag 39.0.0 \
--all-keys \
--all-locales \
--format blob \
--out /tmp/icu4x_data/icu4x_data.postcard
$ cargo build --example work_log --release
$ cargo run --bin icu4x-key-extract -- \
-i target/release/examples/work_log \
-o /tmp/icu4x_data/work_log+keys.txt
$ cat /tmp/icu4x_data/work_log+keys.txt
```

Generate ICU4X Bincode file tree:
Generate ICU4X JSON file tree from the key file for Spanish and German:

```bash
# Run from the icu4x project folder
$ cargo run --bin icu4x-datagen -- \
--cldr-tag 39.0.0 \
--all-keys \
--all-locales \
--syntax bincode \
--out /tmp/icu4x_data/bincode
--cldr-tag 39.0.0 \
--key-file /tmp/icu4x_data/work_log+keys.txt \
--locales es \
--locales de \
--out /tmp/icu4x_data/work_log_json
```

## More Information
Expand Down
Loading