Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use main_process_first instead of broadcast_object_list #458

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

willmj
Copy link
Collaborator

@willmj willmj commented Feb 7, 2025

Description of the change

In the initial version of the data preprocessor, we implemented this check to ensure that data processing occurs only on a single place (rank 0), preventing unnecessary computation on other nodes. The processed dataset is then broadcasted to other ranks in a multi-node setup.

However, during actual runs with datasets of several gigabytes, we encountered broadcast timeout issues originating as NCCL timeouts. Using accelerator.main_process_first should resolve this issue.

This assumption works well in a single-node scenario, where it is assumed all processes in the node have access to the same memory location. In a multi-node scenario this assumption might not be true, and may need to be explored further.

Needs to be tested more extensively. Will be tested in travis on main-process-first-testci branch when token errors are resolved.

Started off by using an Accelerator however this might be too heavyweight, and an Accelerator is re-initialized in SFTTrainer for tuning the model. Having two Accelerators running at once could cause problems. So I am looking into using PartialState instead, which is recommended to be used for process control, which is what we are doing with the preprocessor.

Previous behavior:



[Rank 1] Not processing data. Setting train_dataset to None
[Rank 1] Preparing to broadcast dataset from rank 0. Current train_dataset: None
[Rank 0] Before processing, dataset_configs: [DataSetConfig(name='training_data', data_paths=['/testing/tuning/input/cc_tone_sft_format_1000_train.json'], builder=None, sampling=None, data_handlers=[DataHandlerConfig(name='apply_dataset_formatting', arguments={'fn_kwargs': {'dataset_text_field': 'output'}, 'batched': False})])]
Map (num_proc=256): 100%|█████████████████████████████████████████████████████████████| 1000/1000 [00:29<00:00, 33.72 examples/s]
[Rank 0] After processing, train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})
[Rank 0] Preparing to broadcast dataset from rank 0. Current train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})
[Rank 0] After broadcast, received train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})
/home/tuning/.local/lib/python3.11/site-packages/transformers/training_args.py:2027: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
[Rank 1] After broadcast, received train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})




After behaviour:

[Main Process] Finished processing dataset. train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})
[Process 1] Waiting for main process to process the dataset...
[Process 0] Exiting function with train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})
/home/tuning/.local/lib/python3.11/site-packages/transformers/training_args.py:2027: FutureWarning: `--push_to_hub_token` is deprecated and will be removed in version 5 of 🤗 Transformers. Use `--hub_token` instead.
  warnings.warn(
[Process 1] Joined after main process. Current train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})
[Process 1] Exiting function with train_dataset: Dataset({
    features: ['output'],
    num_rows: 1000
})

PartialState (in data preprocessor on CPU):

Distributed environment: DistributedType.NO
Num processes: 1
Process index: 0
Local process index: 0
Device: mps

PartialState (in data preprocessor on multi GPU):

Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 0
Local process index: 0
Device: cuda:0
Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 2
Local process index: 2
Device: cuda:2
Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 1
Local process index: 1
Device: cuda:1
Distributed environment: DistributedType.MULTI_GPU  Backend: nccl
Num processes: 4
Process index: 3
Local process index: 3
Device: cuda:3

Accelerator (in trainer):

<accelerate.accelerator.Accelerator object at 0x309da9fd0>

remaining steps:

  • Make sure Accelerator is being initialized correctly since there are multiple opportunities for it to get initialized throughout sft_trainer.py
    • This is being initialized multiple times, and multiple accelerators can cause problems, looking into using PartialState
  • e2e tests - successful with Accelerator

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

…broadcast_object_list for multi-GPU

Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Copy link

github-actions bot commented Feb 7, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the fix label Feb 7, 2025
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@willmj willmj force-pushed the fix-gpu-broadcasting branch from 44e7fb1 to 58a113c Compare February 14, 2025 13:42
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@willmj
Copy link
Collaborator Author

willmj commented Feb 14, 2025

Streaming changes + main process first together:

100%|██████████| 2200/2200 [25:37<00:00,  1.43it/s]

Model location: /testing/tuning/output/granite31-8b/ft/tone_20250214_1505-streaming-dataconfig-10-datasets/save_model
F1 score:

"f1": {
"macro": 0.29570348178371664,
"micro": 0.46292134831460674
},

Inference:

input: ### Text: It was unexpected, and it made me really happy\n\n### Response:
response:  satisfied
stop_reason: 2

@willmj willmj marked this pull request as ready for review February 14, 2025 20:00
Signed-off-by: Will Johnson <mwjohnson728@gmail.com>
@kmehant
Copy link
Collaborator

kmehant commented Feb 20, 2025

@willmj can we fastrack this and streaming PRs? Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants