Skip to content

Commit

Permalink
Merge branch 'main' into fix/resolve-warnings
Browse files Browse the repository at this point in the history
  • Loading branch information
Luka-D authored Feb 18, 2025
2 parents 1caf4a9 + fb3ace8 commit d7a1373
Show file tree
Hide file tree
Showing 22 changed files with 805 additions and 53 deletions.
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
- [Prompt Tuning](#prompt-tuning)
- [Fine Tuning](#fine-tuning)
- [FMS Acceleration](#fms-acceleration)
- [Extended Pre-Training](#extended-pre-training)
- [Inference](#inference)
- [Running a single example](#running-a-single-example)
- [Running multiple examples](#running-multiple-examples)
Expand Down Expand Up @@ -133,7 +134,7 @@ Example: Train.json
},
...
]`
data_formatter_template: `### Input: {{input}} \n\n##Label: {{output}}`
data_formatter_template: `### Input: {{input}} \n\n## Label: {{output}}`

Formatting will happen on the fly while tuning. The keys in template should match fields in the dataset file. The `response template` corresponding to the above template will need to be supplied. in this case, `response template` = `\n## Label:`.

Expand Down Expand Up @@ -299,7 +300,7 @@ python tuning/sft_trainer.py \
--gradient_accumulation_steps 4 \
--learning_rate 1e-5 \
--response_template "\n## Label:" \
--data_formatter_template: "### Input: {{input}} \n\n##Label: {{output}}"
--data_formatter_template: "### Input: {{input}} \n\n## Label: {{output}}"

```

Expand All @@ -322,7 +323,6 @@ Below example runs multi-GPU fine tuning on 8 GPUs with FSDP:
# OUTPUT_PATH=out # Path to the output folder where the checkpoints are saved

accelerate launch \
--main_process_port $MASTER_PORT \
--config_file fixtures/accelerate_fsdp_defaults.yaml \
--num_processes=8 \
--main_process_port=$MASTER_PORT \
Expand Down Expand Up @@ -829,6 +829,9 @@ Number of trainable parameters = 13,631,488
The `fms_acceleration.cli` can do more to search for all available configs, plugins and arguments, [see the advanced flow](https://github.com/foundation-model-stack/fms-acceleration#advanced-flow).


## Extended Pre-Training

We also have support for extended pre training where users might wanna pretrain a model with large number of samples. Please refer our separate doc on [EPT Use Cases](./docs/ept.md)

## Inference
Currently, we do *not* offer inference support as part of the library, but we provide a standalone script for running inference on tuned models for testing purposes. For a full list of options run `python scripts/run_inference.py --help`. Note that no data formatting / templating is applied at inference time.
Expand Down
25 changes: 21 additions & 4 deletions docs/advanced-data-preprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,10 @@ definitions:
type: float
builder:
type: string
rename_columns:
type: object
retain_columns:
type: object
data_paths:
type: array
items:
Expand Down Expand Up @@ -118,6 +122,8 @@ Users can create a data config file in any of YAML or JSON format they choose (w
- `name` (optional, str): A unique identifier for the dataset.
- `data_paths` (optional, list): A `list` of file paths or directories containing the dataset.
- `builder` (optional, str): Specifies a [Hugging Face dataset builder](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/loading_methods#datasets.load_dataset.path), if applicable.
- `rename_columns` (optional, dict[str:str]): Specifies a dictionary of columns to rename like `{"old_name": "new_name"}` at dataset load time. *Applied before `retain_columns` if both are specified*.
- `retain_columns` (optional, list[str]): Specifies a list of columns to retain `["input_ids", "labels"]` every other column will be dropped at dataset load time. *Applied strictly after `rename_columns` if both are specified*.
- `sampling` (optional, float): The sampling ratio (0.0 to 1.0) with which to sample a dataset in case of interleaving.
- `data_handlers` (optional, list): A list of data handler configurations which preprocess the dataset.

Expand Down Expand Up @@ -149,6 +155,10 @@ Not Supported:
Currently there's no support for sampling under multiple data paths which are defined inside a dataset definition.
All dataset paths that will be specified inside one dataset will be [concatenated](https://huggingface.co/docs/datasets/v3.2.0/en/process#concatenate) after loading them, while across datasets users can specify [mixing via sampling datasets](#data-mixing)

Probably something like this:

Additionally while loading the dataset, users can specify which columns to rename via `rename_columns` and which to retain via `retain_columns` arguments above.
The order of application of these operations is *strictly rename followed by retain* so users should note that an old column name which is renamed will not be available in retain and hence should be careful while applying these operations. The code will throw a `ValueError` in case user specified a column requested to be renamed via rename argument in retain argument as well.

### How can users specify data handlers.

Expand Down Expand Up @@ -204,14 +214,21 @@ Users can also pass any number of `kwargs` arguments required for each data hand

#### Preexisting data handlers
This library currently supports the following [preexisting data handlers](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py#L156):
- `tokenize_and_apply_input_masking`:
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
- `apply_dataset_formatting`:
Formats a dataset by appending an EOS token to a specified field.
- `add_tokenizer_eos_token`:
Appends the tokenizer's EOS token to a specified dataset field.
- `apply_custom_data_formatting_template`:
Applies a custom template (e.g., Alpaca style) to format dataset elements.
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_template.yaml)
- `tokenize_and_apply_input_masking`:
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml)
- `apply_custom_jinja_template`:
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml)
- `apply_tokenizer_chat_template`:
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
- `duplicate_columns`:
Duplicates one column of the dataset to another column.

These handlers could be requested by their same name and users can lookup the function args from [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py)

Expand Down
112 changes: 112 additions & 0 deletions docs/ept.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Extended Pre Training Support
Our library also supports Extended Pre-Training (EPT), which is generally useful when users want to train a pretrained model on a large number of samples. The training behaviour of EPT is similar to that of pretraining where users might wanna make sure the models runs through entire corpus of data available and be trained on whole set of tokens without any specific masking.

See [below](#additional-information) for information on when this document was last updated and the release which supports this feature.

## Packing support

We support training via `packing` dataset samples by specifing `--packing=True` in the command line parameters. Users can choose to specify `--max_seq_len=<value like 4k/8k>` to provide the maxium sequence length of each chunk post packing.

We provide below details on how to use different style of datasets with the library.

## Non-Tokenized Dataset

### Single Non-Tokenized Dataset
Users can pass a single dataset to the library by using a [data_config](./advanced-data-preprocessing.md#data-config).
Lets say you have a `JSONL` data file which contains text to be trained on in each line that you want to perform EPT on, you can create a `data_config` for the dataset in this manner,

Example dataset,

```
{"Tweet":"@HMRCcustomers No this is my first job","ID":0,"Label":2,"text_label":"no complaint","output":"### Text: @HMRCcustomers No this is my first job\n\n### Label: no complaint"}
{"Tweet":"@KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.","ID":1,"Label":2,"text_label":"no complaint","output":"### Text: @KristaMariePark Thank you for your interest! If you decide to cancel, you can call Customer Care at 1-800-NYTIMES.\n\n### Label: no complaint"}
...
```

Sample data config for the above use case.
```
dataprocessor:
type: default
datasets:
- name: non_tokenized_text_dataset
data_paths:
- "<path-to-the-jsonl-dataset>"
data_handlers:
- name: add_tokenizer_eos_token
arguments:
remove_columns: all
batched: false
fn_kwargs:
dataset_text_field: "dataset_text_field"
```

And the commandline passed to the library should include following.

```
--data_config <path to the data config> --packing=True --max_seq_len 8192
```

Please note that for non tokenized dataset our code adds `EOS_TOKEN` to the lines, for e.g. `Tweet` column before passing that as a dataset.

### Multiple Non Tokenized Datasets

If a user wants to utilize multiple datasets and want to [`sample`](./advanced-data-preprocessing.md#how-the-user-can-write-data-configs) the datasets. This can be achieved by specifying multiple datasets in the data config with different sampling ratios.

Sample data config for sampling among multiple datasets
```
dataprocessor:
type: default
sampling_stopping_strategy: first_exhausted
seed: 66
datasets:
- name: non_tokenized_text_dataset_1
sampling: 0.3
data_paths:
- "FILE_PATH"
data_handlers:
- name: apply_custom_data_formatting_template
arguments:
remove_columns: all
batched: false
fn_kwargs:
dataset_text_field: "dataset_text_field"
template: "dataset_template"
- name: non_tokenized_text_dataset_2
sampling: 0.4
data_paths:
- "FILE_PATH"
data_handlers:
- name: apply_custom_data_formatting_template
arguments:
remove_columns: all
batched: false
fn_kwargs:
dataset_text_field: "dataset_text_field"
template: "dataset_template"
- name: non_tokenized_text_dataset_3
sampling: 0.3
data_paths:
- "FILE_PATH"
data_handlers:
- name: apply_custom_data_formatting_template
arguments:
remove_columns: all
batched: false
fn_kwargs:
dataset_text_field: "dataset_text_field"
template: "dataset_template"
```

NOTE: More in-depth documentation of `sampling_stopping_strategy` and how to specify data mixing parameters in the `data_config` is covered in the [data mixing](./advanced-data-preprocessing.md#data-mixing) section of the advanced data preprocessing documentation

Here also the command line arguments would be

```
--data_config <path to the data config> --packing=True --max_seq_len 8192
```

The code again would add `EOS_TOKEN` to the non tokenized data before using it and also note that the `dataset_text_field` is assumed to be same across all datasets for now.

### Additional Information
This feature is supported post [v2.3.1](https://github.com/foundation-model-stack/fms-hf-tuning/releases/tag/v2.3.1) of this library.
Post Last Updated On: 12-02-2025
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,12 @@ classifiers=[
dependencies = [
"numpy>=1.26.4,<2.0",
"accelerate>=0.20.3,!=0.34,<1.1",
"transformers>=4.45,<4.46",
"transformers>=4.46,<4.48.2",
"torch>=2.2.0,<2.5",
"sentencepiece>=0.1.99,<0.3",
"tokenizers>=0.13.3,<1.0",
"tqdm>=4.66.2,<5.0",
"trl>=0.9.3,<0.12",
"trl>=0.13,<0.15",
"peft>=0.8.0,<0.14",
"protobuf>=5.28.0,<6.0.0",
"datasets>=2.15.0,<3.0",
Expand Down
9 changes: 9 additions & 0 deletions tests/artifacts/predefined_data_configs/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,9 @@
DATA_CONFIG_APPLY_CUSTOM_TEMPLATE_YAML = os.path.join(
PREDEFINED_DATA_CONFIGS, "apply_custom_template.yaml"
)
DATA_CONFIG_APPLY_CUSTOM_JINJA_TEMPLATE_YAML = os.path.join(
PREDEFINED_DATA_CONFIGS, "apply_custom_jinja_template.yaml"
)
DATA_CONFIG_PRETOKENIZE_JSON_DATA_YAML = os.path.join(
PREDEFINED_DATA_CONFIGS, "pretokenized_json_data.yaml"
)
Expand All @@ -31,3 +34,9 @@
DATA_CONFIG_MULTIPLE_DATASETS_SAMPLING_YAML = os.path.join(
PREDEFINED_DATA_CONFIGS, "multiple_datasets_with_sampling.yaml"
)
DATA_CONFIG_DUPLICATE_COLUMNS = os.path.join(
PREDEFINED_DATA_CONFIGS, "duplicate_columns.yaml"
)
DATA_CONFIG_RENAME_RETAIN_COLUMNS = os.path.join(
PREDEFINED_DATA_CONFIGS, "rename_retain_columns.yaml"
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
dataprocessor:
type: default
datasets:
- name: apply_custom_data_jinja_template
data_paths:
- "FILE_PATH"
data_handlers:
- name: apply_custom_jinja_template
arguments:
remove_columns: all
batched: false
fn_kwargs:
dataset_text_field: "dataset_text_field"
template: "dataset_template"
add_eos_token: true
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ datasets:
batched: false
fn_kwargs:
dataset_text_field: "dataset_text_field"
template: "dataset_template"
template: "dataset_template"
add_eos_token: true
14 changes: 14 additions & 0 deletions tests/artifacts/predefined_data_configs/duplicate_columns.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
dataprocessor:
type: default
datasets:
- name: pre_tokenized_with_only_input_ids
data_paths:
- "FILE_PATH"
data_handlers:
- name: duplicate_columns
arguments:
remove_columns: all
batched: false
fn_kwargs:
old_column: "input_ids"
new_column: "labels"
20 changes: 20 additions & 0 deletions tests/artifacts/predefined_data_configs/rename_retain_columns.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
dataprocessor:
type: default
datasets:
- name: text_dataset_input_output_masking
rename_columns:
"input" : "instruction"
"output" : "response"
retain_columns:
- "instruction"
- "response"
data_paths:
- "FILE_PATH"
data_handlers:
- name: tokenize_and_apply_input_masking
arguments:
remove_columns: all
batched: false
fn_kwargs:
input_field_name: instruction
output_field_name: response
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ datasets:
batched: false
fn_kwargs:
input_field_name: input
output_field_name: output
output_field_name: output
add_eos_token: true
4 changes: 4 additions & 0 deletions tests/artifacts/testdata/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@
TWITTER_COMPLAINTS_TOKENIZED_JSON = os.path.join(
JSON_DATA_DIR, "twitter_complaints_tokenized_with_maykeye_tinyllama_v0.json"
)
TWITTER_COMPLAINTS_TOKENIZED_ONLY_INPUT_IDS_JSON = os.path.join(
JSON_DATA_DIR,
"twitter_complaints_tokenized_with_maykeye_tinyllama_v0_only_input_ids.json",
)
TWITTER_COMPLAINTS_TOKENIZED_JSONL = os.path.join(
JSONL_DATA_DIR, "twitter_complaints_tokenized_with_maykeye_tinyllama_v0.jsonl"
)
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
[
{
"input_ids": [1, 16121, 9211, 31871, 1662, 31866, 31856, 7416, 17632, 369, 1398, 433, 322, 629, 712, 1784, 13, 13, 8458, 31922, 21597, 31871, 697, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 1662, 31892, 1260, 31825, 11273, 503, 31857, 632, 5284, 365, 329, 553, 1280, 31905, 960, 365, 6194, 289, 11025, 31844, 365, 473, 987, 12207, 4218, 389, 31822, 31853, 31854, 31886, 31852, 31852, 31854, 11300, 31847, 3873, 1507, 31843, 13, 13, 8458, 31922, 21597, 31871, 697, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 960, 312, 473, 31876, 31824, 685, 629, 31822, 31878, 4449, 5861, 287, 1662, 1299, 1574, 1590, 31833, 263, 1360, 1299, 1574, 289, 623, 31822, 31824, 16346, 312, 31876, 31836, 994, 277, 3560, 567, 31843, 672, 322, 260, 29458, 288, 629, 14881, 31843, 2628, 1423, 1662, 31858, 601, 1662, 31858, 601, 8378, 13, 13, 8458, 31922, 21597, 31871, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 1662, 7766, 1078, 8123, 17561, 308, 3456, 1833, 975, 10849, 291, 4372, 15379, 504, 10011, 2368, 1512, 31822, 31855, 31852, 31852, 1243, 31843, 3007, 322, 433, 31843, 13, 13, 8458, 31922, 21597, 31871, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 12371, 2208, 26657, 31844, 560, 14138, 31843, 21994, 1257, 24870, 496, 31829, 8198, 19057, 13, 13, 8458, 31922, 21597, 31871, 697, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 1662, 31836, 651, 307, 395, 13094, 672, 1467, 701, 333, 515, 31844, 504, 1097, 2266, 282, 305, 781, 31902, 21626, 31822, 31824, 5540, 397, 560, 5253, 662, 365, 31876, 263, 4985, 31854, 8903, 16801, 291, 612, 31925, 2011, 1129, 31824, 31843, 1358, 31873, 19919, 31824, 31865, 31829, 469, 2131, 31874, 13, 13, 8458, 31922, 21597, 31871, 697, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 1662, 31900, 307, 31837, 473, 382, 685, 266, 3195, 17532, 329, 260, 1173, 9363, 352, 1671, 1881, 646, 619, 31822, 31882, 5556, 504, 2091, 31822, 31882, 31843, 31855, 31861, 405, 499, 382, 863, 260, 31822, 31878, 4449, 2540, 2042, 31902, 13, 13, 8458, 31922, 21597, 31871, 697, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 1662, 14390, 16373, 337, 312, 435, 697, 1579, 291, 266, 3925, 322, 1434, 291, 3877, 31843, 1456, 365, 499, 1419, 562, 433, 31902, 13, 13, 8458, 31922, 21597, 31871, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 7265, 7550, 389, 1662, 31856, 2226, 11596, 27771, 898, 31843, 3259, 647, 312, 498, 288, 635, 31844, 518, 3822, 397, 2168, 28910, 31873, 13627, 4107, 1708, 31843, 312, 31876, 608, 1090, 629, 10279, 289, 1662, 29966, 31831, 5605, 13, 13, 8458, 31922, 21597, 31871, 9566]
},
{
"input_ids": [1, 16121, 9211, 31871, 1662, 31884, 1450, 7064, 31847, 6538, 30894, 4472, 289, 362, 828, 31843, 864, 685, 541, 9932, 843, 584, 18694, 31986, 13, 13, 8458, 31922, 21597, 31871, 697, 9566]
}
]
2 changes: 1 addition & 1 deletion tests/build/test_launch_script.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
"num_train_epochs": 5,
"per_device_train_batch_size": 4,
"per_device_eval_batch_size": 4,
"gradient_accumulation_steps": 4,
"gradient_accumulation_steps": 1,
"learning_rate": 0.00001,
"weight_decay": 0,
"warmup_ratio": 0.03,
Expand Down
Loading

0 comments on commit d7a1373

Please sign in to comment.