Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement icu4x-key-extract by string tagging #1480

Merged
merged 7 commits into from
Jan 23, 2022

Conversation

robertbastian
Copy link
Member

@robertbastian robertbastian commented Jan 7, 2022

@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@sffc sffc mentioned this pull request Jan 15, 2022
@sffc
Copy link
Member

sffc commented Jan 15, 2022

This is a clever approach! However, there are two issues:

  1. It ties us to using a macro to generate the ResourceKey, since concat! requires string literals, and we don't have string literals inside const functions. I have long wanted to replace the macro with a const function and I am finally doing it in Re-write ResourceKey #1511.
  2. This approach amounts to duplicating the long form of the ResourceKey string in the executable, which increases code size. It is behind a feature flag, but ideally clients don't need to recompile their app with different settings in order to run the tool.

@robertbastian
Copy link
Member Author

The idea for 2 was something like

struct ResourceKey {
  tag_and_human: &'static str,
  pub machine: Something,
}

impl ResourceKey {
  fn human(&self) -> &str {
    &self.tag_and_human[6..]
  }
}

But yes, this requires a macro, but why is that undesirable?

@sffc
Copy link
Member

sffc commented Jan 16, 2022

We cannot get around the requirement to have a public const function constructor, because macros are codegen at the call site, and they need to be able to call a function in order to construct the ResourceKey. We could still choose to have a macro if we feel that it increases ergonomics, but we need to enforce invariants within the const function.

So, basically, we could potentially make a ResourceKey constructor that looks like

impl ResourceKey {
    /// Returns an error if path_with_tag does not begin with "\nicu4x_key_tag" and does not end with "\n"
    pub const fn try_new(path_with_tag: &str) -> Result<Self, ()> {
        // validate that path_with_tag is in the correct syntax
        // ...
    }
}

And then the macro could use concat! to put together the path_with_tag.

If we go with the approach I've proposed in #1511, the const function takes a human-readable path with no requirement that it also include the tag, and the tag is purely internal to the code with the repr(C) trick.

I do have some worry that if we make this revolve around the human-readable tag, we could end up with false positives. I think false positives are less likely to happen with the repr(C) approach. However, I could be wrong.

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • provider/core/src/hello_world.rs is no longer changed in the branch
  • provider/core/src/lib.rs is different
  • provider/core/src/resource.rs is different
  • provider/core/tests/sizes.rs is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • provider/core/src/resource.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/scripts/data.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • .gitignore is now changed in the branch
  • Makefile.toml is now changed in the branch
  • tools/datagen/README.md is now changed in the branch
  • tools/datagen/src/bin/datagen.rs is now changed in the branch
  • tools/datagen/src/main.rs is now changed in the branch
  • tools/datagen/tests/testdata/work_log+keys.txt is now changed in the branch
  • tools/scripts/data.toml is different
  • tools/scripts/wasm.toml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/scripts/data.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@robertbastian robertbastian changed the title Alternative key extract Implement icu4x-key-extract by string tagging Jan 20, 2022
Comment on lines 82 to 88
// unsafe {
// // Safe due to invariant that self.path is tagged correctly
// core::str::from_utf8_unchecked(core::slice::from_raw_parts(
// self.path.as_ptr().add(leading_tag!().len()),
// self.path.len() - trailing_tag!().len() - leading_tag!().len(),
// ))
// }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah let's go with this

example::ResourceKey::get_path:
        mov     rax, qword ptr [rdi]
        mov     rdx, qword ptr [rdi + 8]
        add     rax, 14
        add     rdx, -15
        ret

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness: self.path

example::ResourceKey::get_path:
        mov     rax, qword ptr [rdi]
        mov     rdx, qword ptr [rdi + 8]
        ret

I think we can live with two extra adds.

// self.path.len() - trailing_tag!().len() - leading_tag!().len(),
// ))
// }
&self.path[leading_tag!().len()..self.path.len() - trailing_tag!().len()]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed very panicky

example::ResourceKey::get_path:
        push    rax
        mov     rax, qword ptr [rdi]
        mov     rsi, qword ptr [rdi + 8]
        lea     rcx, [rsi - 1]
        cmp     rcx, 14
        jb      .LBB0_4
        cmp     rsi, 15
        jb      .LBB0_4
        cmp     byte ptr [rax + 14], -64
        jl      .LBB0_4
        cmp     byte ptr [rax + rcx], -65
        jle     .LBB0_4
        add     rax, 14
        add     rsi, -15
        mov     rdx, rsi
        pop     rcx
        ret
.LBB0_4:
        lea     r8, [rip + .L__unnamed_1]
        mov     edx, 14
        mov     rdi, rax
        call    qword ptr [rip + core::str::slice_error_fail@GOTPCREL]
        ud2

.L__unnamed_2:
        .ascii  "/app/example.rs"

.L__unnamed_1:
        .quad   .L__unnamed_2
        .asciz  "\017\000\000\000\000\000\000\000\"\000\000\000\n\000\000"

@robertbastian robertbastian marked this pull request as ready for review January 20, 2022 19:09
@robertbastian robertbastian requested review from Manishearth and a team as code owners January 20, 2022 19:09
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • tools/datagen/README.md is different
  • tools/datagen/src/main.rs is different
  • tools/scripts/data.toml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is missing the actual icu4x-keyextract binary. Do you plan to add it?

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • .github/workflows/build-test.yml is now changed in the branch
  • provider/core/src/lib.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@robertbastian
Copy link
Member Author

The key extract 'binary' is here

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • provider/core/src/resource.rs is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

sffc
sffc previously approved these changes Jan 21, 2022
Copy link
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, and thank you for addressing all my concerns!

Manishearth
Manishearth previously approved these changes Jan 21, 2022
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • .github/workflows/build-test.yml is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • .github/workflows/build-test.yml is no longer changed in the branch
  • .gitignore is no longer changed in the branch
  • Makefile.toml is no longer changed in the branch
  • tools/datagen/README.md is no longer changed in the branch
  • tools/datagen/src/bin/datagen.rs is no longer changed in the branch
  • tools/datagen/src/main.rs is no longer changed in the branch
  • tools/datagen/tests/testdata/work_log+keys.txt is no longer changed in the branch
  • tools/scripts/wasm.toml is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

echo "datetime/symbols@1" >> ${2}
echo "plurals/ordinal@1" >> ${2}
"""
script = '''strings ${1} | grep -e "icu4x_key_tag\w" | sed s/icu4x_key_tag// | sort > ${2}'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought: Although using strings is handy, we should still offer a cross-platform solution I think. Opinions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how complex strings is, i.e. it can probably handle many different binary formats etc. Could be hard to match its sophistication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify manually that it works for at least x86_64 (elf) and wasm files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify manually that it works for at least x86_64 (elf) and wasm files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually verified ELF, and WASM I believe as well, but that's covered by the CI test anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this https://docs.microsoft.com/en-us/sysinternals/downloads/strings. The piping, grep, and sed probably won't work on windows though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes me uneasy to think that a pivotal part of the ICU4X data pipeline rests on piping strings to sed. I'd rather have a tool that takes an executable and returns the key list.

(My approval stands; please file a follow up to discuss further)

@Manishearth Manishearth removed their request for review January 21, 2022 21:15
@robertbastian robertbastian merged commit 1b159c6 into unicode-org:main Jan 23, 2022
@robertbastian robertbastian deleted the keys branch January 23, 2022 17:53
gnrunge pushed a commit to gnrunge/icu4x that referenced this pull request Jan 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Build tool for static code analysis for data slicing Implement static data slicing
3 participants