Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adding eos token to be made a flag so we don't force it on every handler #467

Merged
merged 8 commits into from
Feb 14, 2025

Conversation

dushyantbehl
Copy link
Contributor

@dushyantbehl dushyantbehl commented Feb 12, 2025

Description of the change

Add a flag add_eos_token to every handler which is set to True by default but can be disabled.

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

handler.

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the feat label Feb 12, 2025
willmj
willmj previously approved these changes Feb 12, 2025
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks Dushyant

- `apply_custom_jinja_template`:
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might be missing something in this PR. How can user pass the value of add_eos_token = False, as for example here in fn_kwargs we don't pass the value of add_eos_token given by the user to the handler. Can I ask what am I missing ?

Probably a small test case when add_eos_token = False in test_data_preprocessing.py would be appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Abhishek-TAMU this is only for data_config where people can specify it and we disable adding EOS_TOKEN to the code.
This is not for our cli based args flow for where the function you pointed out to is used which for specific instruction masking use case on a single dataset file and hence in that part of the code we making sure its already added is ensured.

We can try to add a data config test where we set the flag to false and check the behaviour.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see. Yea user can pass it in fn_kwargs in data_config.yaml
Feel free to add test case for the same.

dushyantbehl and others added 2 commits February 13, 2025 22:58
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
@Abhishek-TAMU
Copy link
Collaborator

Added test case and modified URL of the documentation.
@dushyantbehl @willmj

template: "dataset_template"
add_eos_token: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it defaults to true do we need to add this here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it here to point this config file in the documentation (As you can see the Url in documentation of this PR) for the user to know that they can use this flag to disable EOS_TOKEN from the handler.
And additionally I am assigning it value False in Unit test.

(train_set, _, _) = _process_dataconfig_file(data_args, tokenizer)
assert isinstance(train_set, Dataset)
if datasets_name == "text_dataset_input_output_masking":
column_names = set(["input_ids", "attention_mask", "labels"])
assert set(train_set.column_names) == column_names
print("INFO", train_set[8]["input_ids"], tokenizer.eos_token_id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove print statement here

@@ -682,39 +682,75 @@ def test_process_data_args_throws_error_where_needed(data_args, packing):


@pytest.mark.parametrize(
"data_config_path, data_path",
"data_config_path, data_path, add_eos_token",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a separate unit test for this with add_eos_token?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Pushed the changes. Thanks!

Signed-off-by: Abhishek <maurya.abhishek@ibm.com>
Copy link
Contributor Author

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM .... Thanks @Abhishek-TAMU

Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks Abhishek

@Abhishek-TAMU Abhishek-TAMU merged commit fb3ace8 into foundation-model-stack:main Feb 14, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants