-
Notifications
You must be signed in to change notification settings - Fork 930
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add new nvtext::normalize_characters API (#17818)
Adds new normalizer APIs as part of the rework for the subword-tokenizer. The new API is split into 2 parts. First a normalizer object is created with appropriate state: lower-case and special-tokens. The normalizing tables are currently hardcoded inside libcudf. Future versions of the this may load these tables from some other source. The 2nd API is given the input strings column and the normalizer object and returns a normalized strings column. The normalizer object can be reused on all subsequent `normalize_characters` calls. The current `nvtext::normalize_characters` loads the normalizing tables on each call which can be significant overhead. This API will be deprecated and replaced by these 2 new ones. Some utility functions from that implementation have been refactored to be used by both until the old one is removed. The first API creates the normalizer object. ```cpp std::unique_ptr<character_normalizer> create_character_normalizer( bool do_lower_case, cudf::strings_column_view const& special_tokens, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr); ``` The 2nd API uses the normalizer on a strings column: ```cpp std::unique_ptr<cudf::column> normalize_characters( cudf::strings_column_view const& input, character_normalizer const& normalizer, rmm::cuda_stream_view stream, rmm::device_async_resource_ref mr); ``` Using the python interface: ```python import cudf from cudf.core.character_normalizer import CharacterNormalizer cn = CharacterNormalizer(do_lower=False) sn = cn.normalize(input_strings) ``` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Tianyu Liu (https://github.com/kingcrimsontianyu) - Karthikeyan (https://github.com/karthikeyann) - Matthew Murray (https://github.com/Matt711) URL: #17818
- Loading branch information
1 parent
e365986
commit 18a5412
Showing
16 changed files
with
1,018 additions
and
154 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.