-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge x86 and ARM for pax #338
Comments
1 task
terrykong
added a commit
that referenced
this issue
Dec 1, 2023
…371) # Summary - All Python packages, except for a few build dependencies, are now installed using **pip-tools**. - The JAX and upstream T5X/PAX containers are now built in a two-stage procedure: 1. The **'meal kit'** stage: source packages are downloaded, wheels built if necessary (for TE, tensorflow-text, lingvo, etc.), but **no** package is installed. Instead, manifest files are created in the `/opt/pip-tools.d` folder to instruct which packages shall be installed by pip-tools. The stage is named due to its similarity in how ingredients in a meal kit are prepared while deferring the final cooking step. 2. The **'final'** (cooking🔥) stage: this is when pip-tools collectively compile the manifests from the various container layers and then sync-install everything to exactly match the resolved versions. - Note that downstream containers will **build on top of the meal kit image of its base container**, thus ensuring all packages and dependencies are installed exactly once to avoid conflicts and image bloating. - The meal kit and final images are published as - mealkit: `ghcr.io/nvidia/image:mealkit` and `ghcr.io/nvidia/image:mealkit-YYYY-MM-DD` - final: `ghcr.io/nvidia/image:latest` and `ghcr.io/nvidia/image:nightly-YYYY-MM-DD` # Additional changes to the workflows - `/opt/jax-source` is renamed to `/opt/jax`. The `-source` suffix is only added to packages that needs compilation, e.g. XLA and TE. - The CI workflow is now matricized against CPU arch. - The reusable `_build_*.yaml` workflows are simplified to build only one image for a single architecture at a time. The logic for creating multi-arch images is relocated into the `_publish_container.yaml` workflows and involved during the nightly runs only. - TE is now built as a wheel and shipped in the JAX core meal kit image. - TE unit tests will be performed using the upstream-pax image due to the dependency on praxis. - Build workflows now produce sitreps following the paradigm of #229. - Removed the various one-off workflows for pinned CUDA/JAX versions. - Refactored the PAX arm64 Dockerfile in preparation for #338 # What remains to be done - [ ] Update the Rosetta container build + test process to use the upstream T5X/PAX mealkit (ghcr.io/nvidia/upstream-t5x:mealkit, ghcr.io/nvidia/upstream-pax:mealkit) containers # Reviewing tips This PR requires a multitude of reviewers due to its size and scope. I'd truly appreciate code owners to review any changes related to their previous contributions. An incomplete list of reviewer-scope is: - @terrykong, @ashors1, @sharathts, @maanug-nv: Rosetta, TE, T5X and PAX MGMN tests - @nouiz: JAX, TE and T5X build - @joker-eph: PAX arm64 build - @nluehr: Base image, NCCL, PAX - @DwarKapex: base/JAX/XLA build, workflow logic Closes #223 Closes #230 Closes #231 Closes #232 Closes #233 Closes #271 Fixes #328 Fixes #337 Co-authored-by: Terry Kong <terryk@nvidia.com> --------- Co-authored-by: Terry Kong <terryk@nvidia.com> Co-authored-by: Vladislav Kozlov <vkozlov@nvidia.com>
6 tasks
We don't test pax anymore. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
No description provided.
The text was updated successfully, but these errors were encountered: