Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add support for renaming and retaining columns in data preprocessor #466

Merged

Conversation

dushyantbehl
Copy link
Contributor

Description of the change

Add two flags to the data_config under dataset definition.

  1. rename_columns which allows users to rename columns by passing a dict
  2. retain_columns which allows users to specify which columns to retain and drop others.

This is different and outside the remove_columns argument to data handlers because people might want to use this functionality even without invoking data handlers.

This is especially needed in the case of interleaving multiple datasets with different features.

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

preprocessor.

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@dushyantbehl dushyantbehl changed the title Add support for renaming and retaining columns in data preprocessor feat: Add support for renaming and retaining columns in data preprocessor Feb 12, 2025
@github-actions github-actions bot added the feat label Feb 12, 2025
Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
willmj
willmj previously approved these changes Feb 12, 2025
Copy link
Collaborator

@willmj willmj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, this might require some investigation with iterable datasets after it is merged in.

@willmj
Copy link
Collaborator

willmj commented Feb 12, 2025

Should we add one more test in tests/data/test_datapreprocessing.py to assert column names have been appropriately retained and renamed?

@dushyantbehl
Copy link
Contributor Author

This looks good to me, this might require some investigation with iterable datasets after it is merged in.

I checked the Iterable datasets API and it does contain rename and retain APIs but you can add a few tests to your PR for sanity.

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
@dushyantbehl
Copy link
Contributor Author

Should we add one more test in tests/data/test_datapreprocessing.py to assert column names have been appropriately retained and renamed?

@willmj added the tests here

Abhishek-TAMU
Abhishek-TAMU previously approved these changes Feb 13, 2025
Copy link
Collaborator

@Abhishek-TAMU Abhishek-TAMU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart for this small nit, LGTM! Thanks Dushyant!

Signed-off-by: Dushyant Behl <dushyantbehl@in.ibm.com>
@Abhishek-TAMU Abhishek-TAMU merged commit f1fd130 into foundation-model-stack:main Feb 13, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants