From a8a2baeb9b50a223e28b13c0e0fcdcaced3eda8a Mon Sep 17 00:00:00 2001 From: Will Johnson Date: Fri, 17 Jan 2025 13:14:46 -0500 Subject: [PATCH 1/2] docs: EOS token support Signed-off-by: Will Johnson --- README.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index f7e3c2c28..21b564aea 100644 --- a/README.md +++ b/README.md @@ -64,10 +64,11 @@ For more details on how to enable and use the trackers, Please see, [the experim ## Data Support Users can pass training data as either a single file or a Hugging Face dataset ID using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below). If user choose to pass a file, it can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly. - Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md). -## Supported Data Formats +EOS tokens are added to all data formats listed below except for pretokenized at this time. For more info, see [pretokenized](#4-pre-tokenized-datasets). + +## Supported Data File Formats We support the following file formats via `--training_data_path` argument Data Format | Tested Support @@ -169,15 +170,17 @@ For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b- The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat. -### 3. Pre tokenized datasets. +### 4. Pre tokenized datasets. Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g. +At this time, the data preprocessor does not add EOS tokens to pretokenized datasets, users must ensure EOS tokens are included in their pretokenized data if needed. + ``` python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow ``` -### 4. Advanced data preprocessing. +### Advanced data preprocessing. For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md). From f8e1a1bd42bdc6a8c27de70f0e056d80f5ed564c Mon Sep 17 00:00:00 2001 From: Will Johnson Date: Fri, 17 Jan 2025 13:33:09 -0500 Subject: [PATCH 2/2] Update README.md Co-authored-by: Sukriti Sharma Signed-off-by: Will Johnson --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 21b564aea..ff257b17c 100644 --- a/README.md +++ b/README.md @@ -66,7 +66,7 @@ Users can pass training data as either a single file or a Hugging Face dataset I Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md). -EOS tokens are added to all data formats listed below except for pretokenized at this time. For more info, see [pretokenized](#4-pre-tokenized-datasets). +EOS tokens are added to all data formats listed below (EOS token is appended to the end of each data point, like a sentence or paragraph within the dataset), except for pretokenized data format at this time. For more info, see [pretokenized](#4-pre-tokenized-datasets). ## Supported Data File Formats We support the following file formats via `--training_data_path` argument