-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: utilities to post process checkpoint for LoRA #338
Conversation
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of comments, thanks
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Thanks for making a pull request! 😃 |
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
I added a commit to add unit test in
|
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
…ss_LoRA Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
* get num_added_tokens Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> * remove extra code Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com> --------- Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
c941511
to
36a554c
Compare
Signed-off-by: Angel Luu <angel.luu@us.ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Note these tests are failing because they are running old unit tets (see changes in previous commit) |
"pad_token": "<pad>", | ||
} | ||
) | ||
special_tokens_dict["bos_token"] = "<s>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was just done to ensure we add and resize embeddings in same function
return trainer | ||
additional_metadata = {} | ||
additional_metadata["added_tokens_info"] = added_tokens_dict | ||
return trainer, additional_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning additional metadata from train() containing information of newly added tokens. @kmehant @ashokponkumar let me know if this change is ok with you - it might be disruptive if using SDK and if you were relying on 1 return value. We can make this a major release with the change to clarify API changed.
Alternatively, I can write the file artifacts I want inside 'output_dir' and later can copy it to save_model_dir as well, if return value is an issue. Then I can avoid returning the dict. I just thought this return value could be extendable to any other metadata we need in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove custom addition of tokens from fms-hf-tuning, we might not need to return this added tokens metadata.
@@ -0,0 +1,29 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files are all just dummy LoRA artifacts needed for unit tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work all! Few questions and changes, thanks for adding docs! I know you talked about adding some unit tests for the tokenizer_data_utils which would be good as well
@@ -44,3 +52,4 @@ def tokenizer_and_embedding_resize( | |||
|
|||
input_embeddings[-num_new_tokens:] = input_embeddings_avg | |||
output_embeddings[-num_new_tokens:] = output_embeddings_avg | |||
return {"num_new_tokens": num_new_tokens, "new_embedding_size": embedding_size} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to verify, new_embedding_size could be:
- the new size of the embeddings after new tokens are added
- new size of embeddings after resized to multiple_of
- or the same size if none of the above occurs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its size of embeddings at end of tuning , whichever case it is. Will be helpful to decipher weight names based on size
"w", | ||
encoding="utf-8", | ||
) as f: | ||
json.dump(additional_train_info["added_tokens_info"], f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check that num_new_tokens > 0 otherwise don't need to write this file? This is for the case where the model already has all the tokens it needs so no new tokens or embedding resize is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no , writing it in all cases to differentiate between "tuning was not done on latest code / no tokens were added". The file will always exist, it will contain value 0 if no tokens added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a check in post-process script to exit early if num_added_tokens == 0 though https://github.com/foundation-model-stack/fms-hf-tuning/pull/338/files#diff-443a0a5c8690e1a9251b2805f8e5850c97381d40b47cd90c5716eeaca84706d9R61
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
@anhuong @willmj I have added sufficient unit tests pertaining to this change. Lets wait on @ashokponkumar to check the PR as well. |
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few clarifications..
special_tokens_dict = {} | ||
if not model_args.tokenizer_name_or_path: | ||
# TODO: understand if we need to hardcode these here or just use defaults in model | ||
if isinstance(tokenizer, (LlamaTokenizer, LlamaTokenizerFast)): | ||
tokenizer.add_special_tokens( | ||
{ | ||
"bos_token": "<s>", | ||
"eos_token": "</s>", | ||
"unk_token": "<unk>", | ||
"pad_token": "<pad>", | ||
} | ||
) | ||
special_tokens_dict["bos_token"] = "<s>" | ||
special_tokens_dict["eos_token"] = "</s>" | ||
special_tokens_dict["unk_token"] = "<unk>" | ||
special_tokens_dict["pad_token"] = "<pad>" | ||
elif isinstance(tokenizer, (GPT2Tokenizer, GPTNeoXTokenizerFast)): | ||
tokenizer.add_special_tokens( | ||
{ | ||
"pad_token": "<pad>", | ||
} | ||
) | ||
special_tokens_dict["pad_token"] = "<pad>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my ignorance.
But just wondering, why not just remove this logic that is specific to llama tokernizer or GPT2 tokenizer out of fms-hf-tuning, we just assume the base model just has it? This will remove the need for all these additional tokens. For example, I am not sure why unk
token is being added. My undertanding was BPE based tokenizers never end up creating unk
tokens.
Or we can make these tokens adding a externally flag, that allows adding custom tokens to any tokernizer. This would make fms-hf-tuning more generic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel there is often some confusion here, this piece of code was taken from openinstruct https://github.com/allenai/open-instruct/blob/main/open_instruct/finetune.py#L635 , if you see they have mentioned that though the tokens are already in model, they are not added as special_tokens. This can have some implications in quality as models treat special tokens different from regular tokens - do not split it etc. Hence this change is just making those tokens special tokens instead and not adding any new tokens.
Num_added_tokens will just eb the pad token
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pad token is always needed to be added for any model architecture - that is a open source requirement by SFT Trainer, without which training will not proceed. Hence we always add pad token for all architectures here https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/sft_trainer.py#L278 (except granite which already has pad token)
Hence minimum 1 token is always added and hence post processing is needed
return trainer | ||
additional_metadata = {} | ||
additional_metadata["added_tokens_info"] = added_tokens_dict | ||
return trainer, additional_metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we remove custom addition of tokens from fms-hf-tuning, we might not need to return this added tokens metadata.
Signed-off-by: Anh Uong <anh.uong@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
@ashokponkumar as explained on slack, we cannot remove addition of pad token as it is a rwquirement and known in open source. Without which we will run into this error https://stackoverflow.com/questions/70544129/transformers-asking-to-pad-but-the-tokenizer-does-not-have-a-padding-token Most models in open source do not have pad token - new llama3.1, llama3, llama, allam, mixtral, mistral . Hence we have to add minimal 1 token for all these architectures, which we do in generic manner 'if pad token is None, set it' This PR is thus doing the post-processing needed to handle addition of any token for LoRA inferencing on vLLM. Without this PR, LoRA inference on vLLM does not work for any of above architectures. Even if we remove other tokens, like unk etc - which can be done in following PR and issue , the change is still needed for pad token. Hence better to keep the change generic and return number of added tokens from code |
Signed-off-by: Sukriti-Sharma4 <sukriti.sharma4@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
changes addressed and reviewer not available
Description of the change
utility function to post process checkpoint after LoRA tuning to convert to format required by vLLM.
This will need to be called at end of LoRA tuning to allow inferencing on LoRA, for models for which we have added new tokens.
Since it is fast enough to load adapters.safetensors , added it as a post-processing function.
This PR adds a script that can be called after tuning to do the processing.
Related issue number
https://github.ibm.com/ai-foundation/watson-fm-stack-tracker/issues/1210
Details on vLLM issue: vllm-project/vllm#2816 (comment)
Context: Embedding vectors for new tokens need to be placed in
new_embeddings.safetensors
and lm_head.weight should be deleted fromadapter_model.safetensors
as per vllm-project/vllm#2816 (comment)How to verify the PR
Was the PR tested