Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge x86 and ARM for pax #338

Closed
yhtang opened this issue Oct 25, 2023 · 2 comments
Closed

merge x86 and ARM for pax #338

yhtang opened this issue Oct 25, 2023 · 2 comments

Comments

@yhtang
Copy link
Collaborator

yhtang commented Oct 25, 2023

No description provided.

@yhtang yhtang converted this from a draft issue Oct 25, 2023
terrykong added a commit that referenced this issue Dec 1, 2023
…371)

# Summary

- All Python packages, except for a few build dependencies, are now
installed using **pip-tools**.
- The JAX and upstream T5X/PAX containers are now built in a two-stage
procedure:
1. The **'meal kit'** stage: source packages are downloaded, wheels
built if necessary (for TE, tensorflow-text, lingvo, etc.), but **no**
package is installed. Instead, manifest files are created in the
`/opt/pip-tools.d` folder to instruct which packages shall be installed
by pip-tools. The stage is named due to its similarity in how
ingredients in a meal kit are prepared while deferring the final cooking
step.
2. The **'final'** (cooking🔥) stage: this is when pip-tools collectively
compile the manifests from the various container layers and then
sync-install everything to exactly match the resolved versions.
- Note that downstream containers will **build on top of the meal kit
image of its base container**, thus ensuring all packages and
dependencies are installed exactly once to avoid conflicts and image
bloating.
- The meal kit and final images are published as
- mealkit: `ghcr.io/nvidia/image:mealkit` and
`ghcr.io/nvidia/image:mealkit-YYYY-MM-DD`
- final: `ghcr.io/nvidia/image:latest` and
`ghcr.io/nvidia/image:nightly-YYYY-MM-DD`

# Additional changes to the workflows

- `/opt/jax-source` is renamed to `/opt/jax`. The `-source` suffix is
only added to packages that needs compilation, e.g. XLA and TE.
- The CI workflow is now matricized against CPU arch.
- The reusable `_build_*.yaml` workflows are simplified to build only
one image for a single architecture at a time. The logic for creating
multi-arch images is relocated into the `_publish_container.yaml`
workflows and involved during the nightly runs only.
- TE is now built as a wheel and shipped in the JAX core meal kit image.
- TE unit tests will be performed using the upstream-pax image due to
the dependency on praxis.
- Build workflows now produce sitreps following the paradigm of #229.
- Removed the various one-off workflows for pinned CUDA/JAX versions.
- Refactored the PAX arm64 Dockerfile in preparation for #338

# What remains to be done

- [ ] Update the Rosetta container build + test process to use the
upstream T5X/PAX mealkit (ghcr.io/nvidia/upstream-t5x:mealkit,
ghcr.io/nvidia/upstream-pax:mealkit) containers

# Reviewing tips

This PR requires a multitude of reviewers due to its size and scope. I'd
truly appreciate code owners to review any changes related to their
previous contributions. An incomplete list of reviewer-scope is:
- @terrykong, @ashors1, @sharathts, @maanug-nv: Rosetta, TE, T5X and PAX
MGMN tests
- @nouiz: JAX, TE and T5X build
- @joker-eph: PAX arm64 build
- @nluehr: Base image, NCCL, PAX
- @DwarKapex: base/JAX/XLA build, workflow logic

Closes #223
Closes #230 
Closes #231 
Closes #232 
Closes #233 
Closes #271
Fixes #328
Fixes #337 

Co-authored-by: Terry Kong <terryk@nvidia.com>

---------

Co-authored-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Vladislav Kozlov <vkozlov@nvidia.com>
@yhtang
Copy link
Collaborator Author

yhtang commented Dec 12, 2023

@nouiz
Copy link
Collaborator

nouiz commented Jan 20, 2025

We don't test pax anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

2 participants