-
Notifications
You must be signed in to change notification settings - Fork 49
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into fix/resolve-warnings
- Loading branch information
Showing
22 changed files
with
805 additions
and
53 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# Extended Pre Training Support | ||
Our library also supports Extended Pre-Training (EPT), which is generally useful when users want to train a pretrained model on a large number of samples. The training behaviour of EPT is similar to that of pretraining where users might wanna make sure the models runs through entire corpus of data available and be trained on whole set of tokens without any specific masking. | ||
|
||
See [below](#additional-information) for information on when this document was last updated and the release which supports this feature. | ||
|
||
## Packing support | ||
|
||
We support training via `packing` dataset samples by specifing `--packing=True` in the command line parameters. Users can choose to specify `--max_seq_len=<value like 4k/8k>` to provide the maxium sequence length of each chunk post packing. | ||
|
||
We provide below details on how to use different style of datasets with the library. | ||
|
||
## Non-Tokenized Dataset | ||
|
||
### Single Non-Tokenized Dataset | ||
Users can pass a single dataset to the library by using a [data_config](./advanced-data-preprocessing.md#data-config). | ||
Lets say you have a `JSONL` data file which contains text to be trained on in each line that you want to perform EPT on, you can create a `data_config` for the dataset in this manner, | ||
|
||
Example dataset, | ||
|
||
``` | ||
{"Tweet":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"} | ||
{"Tweet":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"} | ||
... | ||
``` | ||
|
||
Sample data config for the above use case. | ||
``` | ||
dataprocessor: | ||
type: default | ||
datasets: | ||
- name: non_tokenized_text_dataset | ||
data_paths: | ||
- "<path-to-the-jsonl-dataset>" | ||
data_handlers: | ||
- name: add_tokenizer_eos_token | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
dataset_text_field: "dataset_text_field" | ||
``` | ||
|
||
And the commandline passed to the library should include following. | ||
|
||
``` | ||
--data_config <path to the data config> --packing=True --max_seq_len 8192 | ||
``` | ||
|
||
Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset. | ||
|
||
### Multiple Non Tokenized Datasets | ||
|
||
If a user wants to utilize multiple datasets and want to [`sample`](./advanced-data-preprocessing.md#how-the-user-can-write-data-configs) the datasets. This can be achieved by specifying multiple datasets in the data config with different sampling ratios. | ||
|
||
Sample data config for sampling among multiple datasets | ||
``` | ||
dataprocessor: | ||
type: default | ||
sampling_stopping_strategy: first_exhausted | ||
seed: 66 | ||
datasets: | ||
- name: non_tokenized_text_dataset_1 | ||
sampling: 0.3 | ||
data_paths: | ||
- "FILE_PATH" | ||
data_handlers: | ||
- name: apply_custom_data_formatting_template | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
dataset_text_field: "dataset_text_field" | ||
template: "dataset_template" | ||
- name: non_tokenized_text_dataset_2 | ||
sampling: 0.4 | ||
data_paths: | ||
- "FILE_PATH" | ||
data_handlers: | ||
- name: apply_custom_data_formatting_template | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
dataset_text_field: "dataset_text_field" | ||
template: "dataset_template" | ||
- name: non_tokenized_text_dataset_3 | ||
sampling: 0.3 | ||
data_paths: | ||
- "FILE_PATH" | ||
data_handlers: | ||
- name: apply_custom_data_formatting_template | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
dataset_text_field: "dataset_text_field" | ||
template: "dataset_template" | ||
``` | ||
|
||
NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to specify data mixing parameters in the `data_config` is covered in the [data mixing](./advanced-data-preprocessing.md#data-mixing) section of the advanced data preprocessing documentation | ||
|
||
Here also the command line arguments would be | ||
|
||
``` | ||
--data_config <path to the data config> --packing=True --max_seq_len 8192 | ||
``` | ||
|
||
The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now. | ||
|
||
### Additional Information | ||
This feature is supported post [v2.3.1](https://github.com/foundation-model-stack/fms-hf-tuning/releases/tag/v2.3.1) of this library. | ||
Post Last Updated On: 12-02-2025 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
15 changes: 15 additions & 0 deletions
15
tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
dataprocessor: | ||
type: default | ||
datasets: | ||
- name: apply_custom_data_jinja_template | ||
data_paths: | ||
- "FILE_PATH" | ||
data_handlers: | ||
- name: apply_custom_jinja_template | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
dataset_text_field: "dataset_text_field" | ||
template: "dataset_template" | ||
add_eos_token: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
14 changes: 14 additions & 0 deletions
14
tests/artifacts/predefined_data_configs/duplicate_columns.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
dataprocessor: | ||
type: default | ||
datasets: | ||
- name: pre_tokenized_with_only_input_ids | ||
data_paths: | ||
- "FILE_PATH" | ||
data_handlers: | ||
- name: duplicate_columns | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
old_column: "input_ids" | ||
new_column: "labels" |
20 changes: 20 additions & 0 deletions
20
tests/artifacts/predefined_data_configs/rename_retain_columns.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
dataprocessor: | ||
type: default | ||
datasets: | ||
- name: text_dataset_input_output_masking | ||
rename_columns: | ||
"input" : "instruction" | ||
"output" : "response" | ||
retain_columns: | ||
- "instruction" | ||
- "response" | ||
data_paths: | ||
- "FILE_PATH" | ||
data_handlers: | ||
- name: tokenize_and_apply_input_masking | ||
arguments: | ||
remove_columns: all | ||
batched: false | ||
fn_kwargs: | ||
input_field_name: instruction | ||
output_field_name: response |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
32 changes: 32 additions & 0 deletions
32
.../testdata/json/twitter_complaints_tokenized_with_maykeye_tinyllama_v0_only_input_ids.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
[ | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 31866, 31856, 7416, 17632, 369, 1398, 433, 322, 629, 712, 1784, 13, 13, 8458, 31922, 21597, 31871, 697, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 31892, 1260, 31825, 11273, 503, 31857, 632, 5284, 365, 329, 553, 1280, 31905, 960, 365, 6194, 289, 11025, 31844, 365, 473, 987, 12207, 4218, 389, 31822, 31853, 31854, 31886, 31852, 31852, 31854, 11300, 31847, 3873, 1507, 31843, 13, 13, 8458, 31922, 21597, 31871, 697, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 960, 312, 473, 31876, 31824, 685, 629, 31822, 31878, 4449, 5861, 287, 1662, 1299, 1574, 1590, 31833, 263, 1360, 1299, 1574, 289, 623, 31822, 31824, 16346, 312, 31876, 31836, 994, 277, 3560, 567, 31843, 672, 322, 260, 29458, 288, 629, 14881, 31843, 2628, 1423, 1662, 31858, 601, 1662, 31858, 601, 8378, 13, 13, 8458, 31922, 21597, 31871, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 7766, 1078, 8123, 17561, 308, 3456, 1833, 975, 10849, 291, 4372, 15379, 504, 10011, 2368, 1512, 31822, 31855, 31852, 31852, 1243, 31843, 3007, 322, 433, 31843, 13, 13, 8458, 31922, 21597, 31871, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 12371, 2208, 26657, 31844, 560, 14138, 31843, 21994, 1257, 24870, 496, 31829, 8198, 19057, 13, 13, 8458, 31922, 21597, 31871, 697, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 31836, 651, 307, 395, 13094, 672, 1467, 701, 333, 515, 31844, 504, 1097, 2266, 282, 305, 781, 31902, 21626, 31822, 31824, 5540, 397, 560, 5253, 662, 365, 31876, 263, 4985, 31854, 8903, 16801, 291, 612, 31925, 2011, 1129, 31824, 31843, 1358, 31873, 19919, 31824, 31865, 31829, 469, 2131, 31874, 13, 13, 8458, 31922, 21597, 31871, 697, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 31900, 307, 31837, 473, 382, 685, 266, 3195, 17532, 329, 260, 1173, 9363, 352, 1671, 1881, 646, 619, 31822, 31882, 5556, 504, 2091, 31822, 31882, 31843, 31855, 31861, 405, 499, 382, 863, 260, 31822, 31878, 4449, 2540, 2042, 31902, 13, 13, 8458, 31922, 21597, 31871, 697, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 14390, 16373, 337, 312, 435, 697, 1579, 291, 266, 3925, 322, 1434, 291, 3877, 31843, 1456, 365, 499, 1419, 562, 433, 31902, 13, 13, 8458, 31922, 21597, 31871, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 7265, 7550, 389, 1662, 31856, 2226, 11596, 27771, 898, 31843, 3259, 647, 312, 498, 288, 635, 31844, 518, 3822, 397, 2168, 28910, 31873, 13627, 4107, 1708, 31843, 312, 31876, 608, 1090, 629, 10279, 289, 1662, 29966, 31831, 5605, 13, 13, 8458, 31922, 21597, 31871, 9566] | ||
}, | ||
{ | ||
"input_ids": [1, 16121, 9211, 31871, 1662, 31884, 1450, 7064, 31847, 6538, 30894, 4472, 289, 362, 828, 31843, 864, 685, 541, 9932, 843, 584, 18694, 31986, 13, 13, 8458, 31922, 21597, 31871, 697, 9566] | ||
} | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.