Skip to content

Commit

Permalink
docs: EOS token support (#443)
Browse files Browse the repository at this point in the history
* docs: EOS token support

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

* Update README.md

Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>

---------

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Co-authored-by: Sukriti Sharma <Ssukriti@users.noreply.github.com>
  • Loading branch information
willmj and Ssukriti authored Jan 17, 2025
1 parent d03072b commit 2a9faec
Showing 1 changed file with 7 additions and 4 deletions.
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,10 +64,11 @@ For more details on how to enable and use the trackers, Please see, [the experim
## Data Support
Users can pass training data as either a single file or a Hugging Face dataset ID using the `--training_data_path` argument along with other arguments required for various [use cases](#use-cases-supported-with-training_data_path-argument) (see details below). If user choose to pass a file, it can be in any of the [supported formats](#supported-data-formats). Alternatively, you can use our powerful [data preprocessing backend](./docs/advanced-data-preprocessing.md) to preprocess datasets on the fly.


Below, we mention the list of supported data usecases via `--training_data_path` argument. For details of our advanced data preprocessing see more details in [Advanced Data Preprocessing](./docs/advanced-data-preprocessing.md).

## Supported Data Formats
EOS tokens are added to all data formats listed below (EOS token is appended to the end of each data point, like a sentence or paragraph within the dataset), except for pretokenized data format at this time. For more info, see [pretokenized](#4-pre-tokenized-datasets).

## Supported Data File Formats
We support the following file formats via `--training_data_path` argument

Data Format | Tested Support
Expand Down Expand Up @@ -169,15 +170,17 @@ For the [granite model above](https://huggingface.co/ibm-granite/granite-3.0-8b-

The code internally uses [`DataCollatorForCompletionOnlyLM`](https://github.com/huggingface/trl/blob/main/trl/trainer/utils.py#L93) to perform masking of text ensuring model learns only on the `assistant` responses for both single and multi turn chat.

### 3. Pre tokenized datasets.
### 4. Pre tokenized datasets.

Users can also pass a pretokenized dataset (containing `input_ids` and `labels` columns) as `--training_data_path` argument e.g.

At this time, the data preprocessor does not add EOS tokens to pretokenized datasets, users must ensure EOS tokens are included in their pretokenized data if needed.

```
python tuning/sft_trainer.py ... --training_data_path twitter_complaints_tokenized_with_maykeye_tinyllama_v0.arrow
```

### 4. Advanced data preprocessing.
### Advanced data preprocessing.

For advanced data preprocessing support including mixing and custom preprocessing of datasets please see [this document](./docs/advanced-data-preprocessing.md).

Expand Down

0 comments on commit 2a9faec

Please sign in to comment.