-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for add special tokens via cli args #473
base: main
Are you sure you want to change the base?
feat: Support for add special tokens via cli args #473
Conversation
Thanks for making a pull request! 😃 |
f61d83f
to
db4a060
Compare
surely there must be some unit tests we can add for this ? @Abhishek-TAMU @willmj can you guide if you see feasible unit tests? |
maybe after this ongoing refactor is done https://github.com/foundation-model-stack/fms-hf-tuning/pull/475/files , unit tests can be added to this to at least ensure all special tokens passed as args are present in special_tokens_dict then other set of unit tests can ensure whatever is in special tokens dict is added to tokenizer |
Yes, in this refactor PR, this could be taken care of.
For this, unit test case could be added now where reference could be taken from this unit test case where |
Also fix this @YashasviChaurasia
|
db4a060
to
cdde1d0
Compare
Signed-off-by: yashasvi <yashasvi@ibm.com>
cdde1d0
to
0b27853
Compare
Description of the change
Adds support for add_special_tokens to tokenizer's vocabulary via cli args
data:image/s3,"s3://crabby-images/a55e4/a55e47ff6394c0fb0304589d50c9ed0b5ad49256" alt="image"
List of Special Tokens can be passed as follows:
Related issue number
Partially Solves #470 , Support for Reserved Special Tokens is not supported yet
How to verify the PR
Run a small sample training job with
--add_special_tokens
flagonce the training is completed load the trained model and check if the special token is tokenized properly as a single token/ or if it is part of the tokenizer's vocabulary
Was the PR tested
Debugging was done to ensure the updated tokens are reflected in the Tokenizer Vocab.
data:image/s3,"s3://crabby-images/e3d97/e3d97ee9fa3feee293de5fbf4a47cbcf78d72832" alt="image"