Skip to content

feat: support exp manager checkpointing with object store via Multi-Storage Client #12747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
May 12, 2025

Conversation

jayya2
Copy link
Contributor

@jayya2 jayya2 commented Mar 23, 2025

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Add support for using multi-storage-client for the Exp Manager checkpointing mechanism so that users who haven't migrated to new checkpoints can still take advantage of the object stores.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the core Changes to NeMo Core label Mar 23, 2025
Signed-off-by: jayya2 <jayya2@users.noreply.github.com>
jayya2 and others added 2 commits April 1, 2025 00:01
Signed-off-by: jayya2 <jayya2@users.noreply.github.com>
@jayya2 jayya2 marked this pull request as ready for review April 1, 2025 16:11
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall ok, but the code is getting super complicated.
Can you avoid trying to add is_msc_* to every single check and simply separate the code paths to be completely independent of each other if msc is there or not?

Right now this mix match of merging msc logic with older logic makes the code unreadable.

@jayya2
Copy link
Contributor Author

jayya2 commented Apr 2, 2025

@titu1994
thanks for the quick review, I addressed most of the comments:

  • Have isolated functions to wrap up MSC calls to prevent a mix of logic
  • Reduce the number of MSC url checks
  • Use full name for msc (after second thought I don't want to mix both multiplestorageclient and msc so I would rather have long name everywhere)

Let me know if further refactoring is required, thanks!

titu1994
titu1994 previously approved these changes Apr 2, 2025
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner than before, thank you !

@jayya2 jayya2 changed the title feat: add support for checkpointing with multistorageclient for Exp M… feat: Support object store checkpointing with Multi-Storage Client for Exp Manager Apr 3, 2025
@jayya2 jayya2 changed the title feat: Support object store checkpointing with Multi-Storage Client for Exp Manager feat: support exp manager checkpointing with object store via Multi-Storage Client Apr 3, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 22, 2025
@ko3n1g ko3n1g removed the Run CICD label Apr 22, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 22, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 22, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 22, 2025
@ko3n1g ko3n1g added Run CICD and removed Run CICD labels Apr 22, 2025
@ko3n1g
Copy link
Collaborator

ko3n1g commented Apr 23, 2025

@jayya2 @titu1994 : This PR got affected by multiple infra issues. Please don't relabel this PR, we will restart failed jobs for you and try to get this merged with as little overhead as possible. TIA!

@jayya2
Copy link
Contributor Author

jayya2 commented Apr 30, 2025

@ericharper @titu1994 @ko3n1g are the infra issues resolved? saw some of the tests retried and passed but still 2 failing but seems unrelated to this change

@ko3n1g ko3n1g added Run CICD and removed Run CICD labels May 6, 2025
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-approving

@ericharper ericharper merged commit b9498ce into NVIDIA:main May 12, 2025
269 of 270 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants