Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for add special tokens via cli args #473

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

YashasviChaurasia
Copy link
Contributor

@YashasviChaurasia YashasviChaurasia commented Feb 18, 2025

Description of the change

Adds support for add_special_tokens to tokenizer's vocabulary via cli args
List of Special Tokens can be passed as follows:
image

Related issue number

Partially Solves #470 , Support for Reserved Special Tokens is not supported yet

How to verify the PR

Run a small sample training job with --add_special_tokens flag

python tuning/sft_trainer.py    --include_tokens_per_second="true" \
    --learning_rate="1e-05" \
    --logging_steps="1" \
    --logging_strategy="steps" \
    --max_steps="100" \
    --model_name_or_path="/your/model" \
    --output_dir="/output/directory" \
    --per_device_train_batch_size="1" \
    --save_steps="50" \
    --save_strategy="steps" \
    --add_special_tokens "<|assistant|>" "<|user|>" \
    --use_flash_attn=false \
    --training_data_path="/dataset"

once the training is completed load the trained model and check if the special token is tokenized properly as a single token/ or if it is part of the tokenizer's vocabulary

Was the PR tested

Debugging was done to ensure the updated tokens are reflected in the Tokenizer Vocab.
image

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@YashasviChaurasia
Copy link
Contributor Author

@kmehant

@YashasviChaurasia YashasviChaurasia changed the title Support for add special tokens via cli args feat: Support for add special tokens via cli args Feb 18, 2025
@github-actions github-actions bot added the feat label Feb 18, 2025
@YashasviChaurasia YashasviChaurasia force-pushed the support_add_special_tokens branch 3 times, most recently from f61d83f to db4a060 Compare February 18, 2025 10:21
@Ssukriti
Copy link
Collaborator

surely there must be some unit tests we can add for this ? @Abhishek-TAMU @willmj can you guide if you see feasible unit tests?

@Ssukriti
Copy link
Collaborator

maybe after this ongoing refactor is done https://github.com/foundation-model-stack/fms-hf-tuning/pull/475/files , unit tests can be added to this to at least ensure all special tokens passed as args are present in special_tokens_dict

then other set of unit tests can ensure whatever is in special tokens dict is added to tokenizer

@Abhishek-TAMU
Copy link
Collaborator

unit tests can be added to this to at least ensure all special tokens passed as args are present in special_tokens_dict

Yes, in this refactor PR, this could be taken care of.

then other set of unit tests can ensure whatever is in special tokens dict is added to tokenizer

For this, unit test case could be added now where reference could be taken from this unit test case where sft_trainer.train() is called and after _validate_training/_test_run_inference, saved model in tempdir could be loaded with Autotokenizer and could be then checked for the special tokens.

@kmehant
Copy link
Collaborator

kmehant commented Feb 20, 2025

Also fix this @YashasviChaurasia

tuning/sft_trainer.py:302:8: W1203: Use lazy % or .format() formatting in logging functions (logging-fstring-interpolation)

@YashasviChaurasia YashasviChaurasia force-pushed the support_add_special_tokens branch from db4a060 to cdde1d0 Compare February 20, 2025 12:51
Signed-off-by: yashasvi <yashasvi@ibm.com>
@YashasviChaurasia YashasviChaurasia force-pushed the support_add_special_tokens branch from cdde1d0 to 0b27853 Compare February 20, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants