-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: adding eos token to be made a flag so we don't force it on every handler #467
feat: adding eos token to be made a flag so we don't force it on every handler #467
Conversation
handler. Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
Thanks for making a pull request! 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thanks Dushyant
docs/advanced-data-preprocessing.md
Outdated
- `apply_custom_jinja_template`: | ||
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements. | ||
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might be missing something in this PR. How can user pass the value of add_eos_token
= False, as for example here in fn_kwargs
we don't pass the value of add_eos_token
given by the user to the handler. Can I ask what am I missing ?
Probably a small test case when add_eos_token
= False in test_data_preprocessing.py would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Abhishek-TAMU this is only for data_config
where people can specify it and we disable adding EOS_TOKEN
to the code.
This is not for our cli based args flow for where the function you pointed out to is used which for specific instruction masking use case on a single dataset file and hence in that part of the code we making sure its already added is ensured.
We can try to add a data config test where we set the flag to false
and check the behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I see. Yea user can pass it in fn_kwargs
in data_config.yaml
Feel free to add test case for the same.
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Added test case and modified URL of the documentation. |
template: "dataset_template" | ||
add_eos_token: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it defaults to true
do we need to add this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it here to point this config file in the documentation (As you can see the Url in documentation of this PR) for the user to know that they can use this flag to disable EOS_TOKEN
from the handler.
And additionally I am assigning it value False
in Unit test.
(train_set, _, _) = _process_dataconfig_file(data_args, tokenizer) | ||
assert isinstance(train_set, Dataset) | ||
if datasets_name == "text_dataset_input_output_masking": | ||
column_names = set(["input_ids", "attention_mask", "labels"]) | ||
assert set(train_set.column_names) == column_names | ||
print("INFO", train_set[8]["input_ids"], tokenizer.eos_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove print statement here
@@ -682,39 +682,75 @@ def test_process_data_args_throws_error_where_needed(data_args, packing): | |||
|
|||
|
|||
@pytest.mark.parametrize( | |||
"data_config_path, data_path", | |||
"data_config_path, data_path, add_eos_token", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do a separate unit test for this with add_eos_token
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Pushed the changes. Thanks!
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM .... Thanks @Abhishek-TAMU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks Abhishek
Description of the change
Add a flag
add_eos_token
to every handler which is set toTrue
by default but can be disabled.Related issue number
How to verify the PR
Was the PR tested