Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

[DO NOT MERGE] AIoEKS Blueprint Consolidation #751

wants to merge 13 commits into from

Conversation

omrishiv
Copy link
Collaborator

@omrishiv omrishiv commented Feb 12, 2025

What does this PR do?

This PR lays out the infrastructure foundation for AIoEKS (AI on EKS). It aims to create one infrastructure deployment that can be customized into different use cases to allow for advanced usage of the AI environment as well as highlighting purpose built blueprints.

Motivation

The current approach to blueprints allows very isolated environments that showcase a single task: deploy model X into EKS, deploy MLFLow, deploy Jupyterhub, etc. This is nice when it comes to isolation, but creates issues with maintainability as each blueprint needs to be updated when addons are updated or when infrastructure needs updating.

This PR aims to consolidate the core infrastructure and addons of all of the DoEKS blueprint and set the foundation for a configurable AI/ML environment based on needs and best practices. This will increase maintainability, allow for better customization, and enable adding more functionality

Contributing

We need help retesting the existing blueprints and deployments to make sure they work in the current environment.

  • Bionemo
  • EMR Spark Rapids
  • Ray
  • Ray HA using Elasticache
  • Trainium
  • Jupyterhub
  • JARK stack

If you are interested in helping, please reach out before you start so we can make sure no one else is working on it.

To contribute, please branch off of this branch in your fork and open PRs against this branch. We will merge into this branch as work is validated and then merge this branch in its entirety back into DoEKS

More

  • Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
  • Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
  • Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
  • Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

  • E2E Test successfully complete before merge?

Additional Notes

Changelog

combined all ai/ml blueprints into one infrastructure

  • jark
  • inferentia/trainium
  • fsx driver + fsx volume
  • mlflow
  • all addons are now toggleable in variables

fixed:

  • gpu only pods only schedule on gpu nodes
  • accelerator nodes are now labeled with their accelerator (neuron/nvidia)
  • remove loadBalancer from argo workflows
  • EFS now uses efs-csi-driver, not NFS tool (broken on bottlerocket)

added:

  • neuron-monitor
  • dcgm

@omrishiv
Copy link
Collaborator Author

omrishiv commented Feb 12, 2025

Addresses #720 , #729, #727

@omrishiv omrishiv mentioned this pull request Feb 14, 2025
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant