Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nemo-automodel: fsdp2 support for peft #12008

Merged
merged 62 commits into from
Feb 4, 2025

Conversation

akoumpa
Copy link
Member

@akoumpa akoumpa commented Jan 31, 2025

What does this PR do ?

Adds support for fsdp2 + peft.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the CI label Jan 31, 2025
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from 5948ab7 to 8721756 Compare January 31, 2025 19:34
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from b21a5ff to de34b2f Compare January 31, 2025 20:53
@akoumpa akoumpa added Run CICD and removed Run CICD labels Jan 31, 2025
BoxiangW
BoxiangW previously approved these changes Jan 31, 2025
Copy link
Collaborator

@BoxiangW BoxiangW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks

@akoumpa akoumpa marked this pull request as ready for review January 31, 2025 22:42
@akoumpa akoumpa enabled auto-merge (squash) January 31, 2025 22:42
@akoumpa akoumpa disabled auto-merge January 31, 2025 22:56
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch 4 times, most recently from 70865f5 to 3665444 Compare February 3, 2025 03:20
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch 3 times, most recently from 8d222a4 to a81e480 Compare February 3, 2025 05:11
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 3, 2025
@akoumpa akoumpa changed the title Use patch_linear_module for FSDP2 nemo-automodel: fsdp2 support for peft Feb 3, 2025
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from d337aa8 to e66d469 Compare February 3, 2025 06:16
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from 04eaaf0 to ba03e8e Compare February 3, 2025 06:23
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 3, 2025
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from da1bd02 to c4bc136 Compare February 3, 2025 21:24
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from 87a97c7 to 0731898 Compare February 3, 2025 21:31
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 3, 2025
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from 1cbd6f6 to 1c088eb Compare February 4, 2025 03:05
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 4, 2025
BoxiangW
BoxiangW previously approved these changes Feb 4, 2025
Copy link
Collaborator

@BoxiangW BoxiangW left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM Thanks

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/automodel_fsdp2_peft_fix branch from feb6136 to 65c7b20 Compare February 4, 2025 04:27
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
@akoumpa akoumpa added Run CICD and removed Run CICD labels Feb 4, 2025
@akoumpa akoumpa enabled auto-merge (squash) February 4, 2025 07:51
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 30.30%. Comparing base (48f10af) to head (b013f69).
Report is 8 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #12008      +/-   ##
==========================================
- Coverage   30.30%   30.30%   -0.01%     
==========================================
  Files        1387     1387              
  Lines      176283   176293      +10     
  Branches    27091    27096       +5     
==========================================
- Hits        53422    53420       -2     
- Misses     118775   118788      +13     
+ Partials     4086     4085       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

github-actions bot commented Feb 4, 2025

[🤖]: Hi @akoumpa 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

@akoumpa akoumpa merged commit b8bae7e into main Feb 4, 2025
219 of 221 checks passed
@akoumpa akoumpa deleted the akoumparouli/automodel_fsdp2_peft_fix branch February 4, 2025 15:23
BoxiangW added a commit that referenced this pull request Feb 7, 2025
* Use patch_linear_module for FSDP2

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Use patch_linear_module for FSDP2

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add fsdp2 strategy to test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add --num-nodes option

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add --num-nodes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rename

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add to_cpu in utils

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use to_cpu

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* shard adapter weights

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use to_cpu from utils

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use get_automodel_from_trainer

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove some mcore logic from save_checkpoint

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* call to_cpu in strategy's save_checkpoint

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix ckpt saving

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstrings

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstrings

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add missing import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplify

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* # noqa: F821

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add docstrings

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* noqa

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request Feb 10, 2025
* Use patch_linear_module for FSDP2

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Use patch_linear_module for FSDP2

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add fsdp2 strategy to test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add --num-nodes option

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add --num-nodes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rename

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add to_cpu in utils

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use to_cpu

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* shard adapter weights

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use to_cpu from utils

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use get_automodel_from_trainer

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove some mcore logic from save_checkpoint

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* call to_cpu in strategy's save_checkpoint

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix ckpt saving

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstrings

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* docstrings

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add missing import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* simplify

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* minor fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* pylint fix :/

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* # noqa: F821

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add docstrings

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* noqa

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* import

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants