diff --git a/ai-ml/jark-stack/terraform/addons.tf b/ai-ml/jark-stack/terraform/addons.tf index 2bd061e16..8ca077c60 100644 --- a/ai-ml/jark-stack/terraform/addons.tf +++ b/ai-ml/jark-stack/terraform/addons.tf @@ -167,7 +167,7 @@ module "eks_blueprints_addons" { } ], } - + #--------------------------------------- # CloudWatch metrics for EKS #--------------------------------------- diff --git a/ai-ml/jark-stack/terraform/helm-values/aws-cloudwatch-metrics-values.yaml b/ai-ml/jark-stack/terraform/helm-values/aws-cloudwatch-metrics-values.yaml index 3b19a5d18..ae3c41d44 100644 --- a/ai-ml/jark-stack/terraform/helm-values/aws-cloudwatch-metrics-values.yaml +++ b/ai-ml/jark-stack/terraform/helm-values/aws-cloudwatch-metrics-values.yaml @@ -8,4 +8,4 @@ resources: # This toleration allows Daemonset pod to be scheduled on any node, regardless of their Taints. tolerations: - - operator: Exists \ No newline at end of file + - operator: Exists diff --git a/website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md b/website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md index 42a158fb0..b531063ad 100644 --- a/website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md +++ b/website/docs/gen-ai/inference/GPUs/nvidia-nim-llama3.md @@ -245,7 +245,7 @@ you will see similar output like the following It's time to test the Llama3 just deployed. First setup a simple environment for the testing. ```bash -cd gen-ai/inference/nvidia-nim/nim-client +cd data-on-eks/gen-ai/inference/nvidia-nim/nim-client python3 -m venv .venv source .venv/bin/activate pip install openai @@ -335,7 +335,7 @@ By applying these optimizations, TensorRT can significantly accelerate LLM infer Deploy the [Open WebUI](https://github.com/open-webui/open-webui) by running the following command: ```sh -kubectl apply -f gen-ai/inference/nvidia-nim/openai-webui-deployment.yaml +kubectl apply -f data-on-eks/gen-ai/inference/nvidia-nim/openai-webui-deployment.yaml ``` **2. Port Forward to Access WebUI** @@ -373,7 +373,7 @@ Enter your prompt, and you will see the streaming results, as shown below: GenAI-Perf can be used as standard tool to benchmark with other models deployed with inference server. But this tool requires a GPU. To make it easier, we provide you a pre-configured manifest `genaiperf-deploy.yaml` to run the tool. ```bash -cd gen-ai/inference/nvidia-nim +cd data-on-eks/gen-ai/inference/nvidia-nim kubectl apply -f genaiperf-deploy.yaml ``` diff --git a/website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md b/website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md index 99f759bab..0dea00bd3 100644 --- a/website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md +++ b/website/docs/gen-ai/inference/GPUs/stablediffusion-gpus.md @@ -121,7 +121,7 @@ aws eks --region us-west-2 update-kubeconfig --name jark-stack **Deploy RayServe Cluster** ```bash -cd ./../gen-ai/inference/stable-diffusion-rayserve-gpu +cd data-on-eks/gen-ai/inference/stable-diffusion-rayserve-gpu kubectl apply -f ray-service-stablediffusion.yaml ``` @@ -198,7 +198,7 @@ Let's move forward with setting up the Gradio app as a Docker container running First, lets build the docker container for the client app. ```bash -cd ../gradio-ui +cd data-on-eks/gen-ai/inference/gradio-ui docker build --platform=linux/amd64 \ -t gradio-app:sd \ --build-arg GRADIO_APP="gradio-app-stable-diffusion.py" \ @@ -263,7 +263,7 @@ docker rmi gradio-app:sd **Step2:** Delete Ray Cluster ```bash -cd ../stable-diffusion-rayserve-gpu +cd data-on-eks/gen-ai/inference/stable-diffusion-rayserve-gpu kubectl delete -f ray-service-stablediffusion.yaml ``` @@ -271,6 +271,6 @@ kubectl delete -f ray-service-stablediffusion.yaml This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order. ```bash -cd ../../../ai-ml/jark-stack/ +cd data-on-eks/ai-ml/jark-stack/ ./cleanup.sh ``` diff --git a/website/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer.md b/website/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer.md index bf5227685..630a7d693 100644 --- a/website/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer.md +++ b/website/docs/gen-ai/inference/GPUs/vLLM-NVIDIATritonServer.md @@ -333,7 +333,7 @@ kubectl -n triton-vllm port-forward svc/nvidia-triton-server-triton-inference-se Next, run the Triton client for each model using the same prompts: ```bash -cd gen-ai/inference/vllm-nvidia-triton-server-gpu/triton-client +cd data-on-eks/gen-ai/inference/vllm-nvidia-triton-server-gpu/triton-client python3 -m venv .venv source .venv/bin/activate pip install tritonclient[all] diff --git a/website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md b/website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md index d2fc10424..5092048da 100644 --- a/website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md +++ b/website/docs/gen-ai/inference/GPUs/vLLM-rayserve.md @@ -238,7 +238,7 @@ You can test with your custom prompts by adding them to the `prompts.txt` file. To run the Python client application in a virtual environment, follow these steps: ```bash -cd gen-ai/inference/vllm-rayserve-gpu +cd data-on-eks/gen-ai/inference/vllm-rayserve-gpu python3 -m venv .venv source .venv/bin/activate pip install requests diff --git a/website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md b/website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md index 9d7a87e73..a896ab7e3 100644 --- a/website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md +++ b/website/docs/gen-ai/inference/Neuron/Mistral-7b-inf2.md @@ -15,7 +15,7 @@ To generate a token in HuggingFace, log in using your HuggingFace account and cl ::: -# Deploying Mistral-7B-Instruct-v0.2 on Inferentia2, Ray Serve, Gradio +# Serving Mistral-7B-Instruct-v0.2 using Inferentia2, Ray Serve, Gradio This pattern outlines the deployment of the [Mistral-7B-Instruct-v0.2](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model on Amazon EKS, utilizing [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) for enhanced text generation performance. [Ray Serve](https://docs.ray.io/en/latest/serve/index.html) ensures efficient scaling of Ray Worker nodes, while [Karpenter](https://karpenter.sh/) dynamically manages the provisioning of AWS Inferentia2 nodes. This setup optimizes for high-performance and cost-effective text generation applications in a scalable cloud environment. Through this pattern, you will accomplish the following: @@ -121,7 +121,7 @@ To deploy the Mistral-7B-Instruct-v0.2 model, it's essential to configure your H export HUGGING_FACE_HUB_TOKEN=$(echo -n "Your-Hugging-Face-Hub-Token-Value" | base64) -cd ../../gen-ai/inference/mistral-7b-rayserve-inf2 +cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2 envsubst < ray-service-mistral.yaml| kubectl apply -f - ``` @@ -190,7 +190,7 @@ The following YAML script (`gen-ai/inference/mistral-7b-rayserve-inf2/gradio-ui. To deploy this, execute: ```bash -cd gen-ai/inference/mistral-7b-rayserve-inf2/ +cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2/ kubectl apply -f gradio-ui.yaml ``` @@ -242,7 +242,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou **Step1:** Delete Gradio App and mistral Inference deployment ```bash -cd gen-ai/inference/mistral-7b-rayserve-inf2 +cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2 kubectl delete -f gradio-ui.yaml kubectl delete -f ray-service-mistral.yaml ``` @@ -251,6 +251,6 @@ kubectl delete -f ray-service-mistral.yaml This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order. ```bash -cd ../../../ai-ml/trainium-inferentia/ +cd data-on-eks/ai-ml/trainium-inferentia/ ./cleanup.sh ``` diff --git a/website/docs/gen-ai/inference/Neuron/llama2-inf2.md b/website/docs/gen-ai/inference/Neuron/llama2-inf2.md index fdeecfd89..892ad8324 100644 --- a/website/docs/gen-ai/inference/Neuron/llama2-inf2.md +++ b/website/docs/gen-ai/inference/Neuron/llama2-inf2.md @@ -1,7 +1,7 @@ --- title: Llama-2 on Inferentia2 sidebar_position: 4 -description: Deploy Llama-2 models on AWS Inferentia accelerators for efficient inference. +description: Serve Llama-2 models on AWS Inferentia accelerators for efficient inference. --- import CollapsibleContent from '../../../../src/components/CollapsibleContent'; @@ -23,7 +23,7 @@ We are actively enhancing this blueprint to incorporate improvements in observab ::: -# Deploying Llama-2-13b Chat Model with Inferentia, Ray Serve and Gradio +# Serving Llama-2-13b Chat Model with Inferentia, Ray Serve and Gradio Welcome to the comprehensive guide on deploying the [Meta Llama-2-13b chat](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). In this tutorial, you will not only learn how to harness the power of Llama-2, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [trn1/inf2](https://aws.amazon.com/machine-learning/neuron/) (powered by AWS Trainium and Inferentia) instances, such as `inf2.24xlarge` and `inf2.48xlarge`, which are optimized for deploying and scaling large language models. @@ -158,7 +158,7 @@ aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia **Deploy RayServe Cluster** ```bash -cd gen-ai/inference/llama2-13b-chat-rayserve-inf2 +cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2 kubectl apply -f ray-service-llama2.yaml ``` @@ -282,7 +282,7 @@ The following YAML script (`gen-ai/inference/llama2-13b-chat-rayserve-inf2/gradi To deploy this, execute: ```bash -cd gen-ai/inference/llama2-13b-chat-rayserve-inf2/ +cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2/ kubectl apply -f gradio-ui.yaml ``` @@ -330,7 +330,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou **Step1:** Delete Gradio App and Llama2 Inference deployment ```bash -cd gen-ai/inference/llama2-13b-chat-rayserve-inf2 +cd data-on-eks/gen-ai/inference/llama2-13b-chat-rayserve-inf2 kubectl delete -f gradio-ui.yaml kubectl delete -f ray-service-llama2.yaml ``` @@ -339,6 +339,6 @@ kubectl delete -f ray-service-llama2.yaml This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order. ```bash -cd ai-ml/trainium-inferentia +cd data-on-eks/ai-ml/trainium-inferentia ./cleanup.sh ``` diff --git a/website/docs/gen-ai/inference/Neuron/llama3-inf2.md b/website/docs/gen-ai/inference/Neuron/llama3-inf2.md index daef298c7..8fbdbe7ee 100644 --- a/website/docs/gen-ai/inference/Neuron/llama3-inf2.md +++ b/website/docs/gen-ai/inference/Neuron/llama3-inf2.md @@ -1,7 +1,7 @@ --- title: Llama-3-8B on Inferentia2 sidebar_position: 3 -description: Deploy Llama-3 models on AWS Inferentia accelerators for efficient inference. +description: Serve Llama-3 models on AWS Inferentia accelerators for efficient inference. --- import CollapsibleContent from '../../../../src/components/CollapsibleContent'; @@ -23,7 +23,7 @@ We are actively enhancing this blueprint to incorporate improvements in observab ::: -# Deploying Llama-3-8B Instruct Model with Inferentia, Ray Serve and Gradio +# Serving Llama-3-8B Instruct Model with Inferentia, Ray Serve and Gradio Welcome to the comprehensive guide on deploying the [Meta Llama-3-8B Instruct](https://ai.meta.com/llama/#inside-the-model) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). @@ -158,7 +158,7 @@ To deploy the llama3-8B-Instruct model, it's essential to configure your Hugging export HUGGING_FACE_HUB_TOKEN= -cd ./../gen-ai/inference/llama3-8b-rayserve-inf2 +cd data-on-eks/gen-ai/inference/llama3-8b-rayserve-inf2 envsubst < ray-service-llama3.yaml| kubectl apply -f - ``` @@ -244,7 +244,7 @@ Let's move forward with setting up the Gradio app as a Docker container running First, lets build the docker container for the client app. ```bash -cd ../gradio-ui +cd data-on-eks/gen-ai/inference/gradio-ui docker build --platform=linux/amd64 \ -t gradio-app:llama \ --build-arg GRADIO_APP="gradio-app-llama.py" \ @@ -298,7 +298,7 @@ docker rmi gradio-app:llama **Step2:** Delete Ray Cluster ```bash -cd ../llama3-8b-instruct-rayserve-inf2 +cd data-on-eks/gen-ai/inference/llama3-8b-instruct-rayserve-inf2 kubectl delete -f ray-service-llama3.yaml ``` @@ -306,6 +306,6 @@ kubectl delete -f ray-service-llama3.yaml This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order. ```bash -cd ../../../ai-ml/trainium-inferentia/ +cd data-on-eks/ai-ml/trainium-inferentia/ ./cleanup.sh ``` diff --git a/website/docs/gen-ai/inference/Neuron/rayserve-ha.md b/website/docs/gen-ai/inference/Neuron/rayserve-ha.md index 963b93064..3c81a61bc 100644 --- a/website/docs/gen-ai/inference/Neuron/rayserve-ha.md +++ b/website/docs/gen-ai/inference/Neuron/rayserve-ha.md @@ -66,7 +66,7 @@ export TF_VAR_enable_rayserve_ha_elastic_cache_redis=true Then, run the `install.sh` script to install the EKS cluster with KubeRay operator and other add-ons. ```bash -cd ai-ml/trainimum-inferentia +cd data-on-eks/ai-ml/trainimum-inferentia ./install.sh ``` @@ -135,7 +135,7 @@ With the above `RayService` configuration, we have enabled GCS fault tolerance f Let's apply the above `RayService` configuration and check the behavior. ```bash -cd ../../gen-ai/inference/ +cd data-on-eks/gen-ai/inference/ envsubst < mistral-7b-rayserve-inf2/ray-service-mistral-ft.yaml| kubectl apply -f - ``` @@ -202,7 +202,7 @@ Finally, we'll provide instructions for cleaning up and deprovisioning the resou **Step1:** Delete Gradio App and mistral Inference deployment ```bash -cd gen-ai/inference/mistral-7b-rayserve-inf2 +cd data-on-eks/gen-ai/inference/mistral-7b-rayserve-inf2 kubectl delete -f gradio-ui.yaml kubectl delete -f ray-service-mistral-ft.yaml ``` @@ -211,6 +211,6 @@ kubectl delete -f ray-service-mistral-ft.yaml This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order. ```bash -cd ../../../ai-ml/trainium-inferentia/ +cd data-on-eks/ai-ml/trainium-inferentia/ ./cleanup.sh ``` diff --git a/website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md b/website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md index b3f5ac32c..b09b6b3b5 100644 --- a/website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md +++ b/website/docs/gen-ai/inference/Neuron/stablediffusion-inf2.md @@ -14,7 +14,7 @@ This example blueprint deploys a `stable-diffusion-xl-base-1-0` model on Inferen ::: -# Deploying Stable Diffusion XL Base Model with Inferentia, Ray Serve and Gradio +# Serving Stable Diffusion XL Base Model with Inferentia, Ray Serve and Gradio Welcome to the comprehensive guide on deploying the [Stable Diffusion XL Base](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) model on Amazon Elastic Kubernetes Service (EKS) using [Ray Serve](https://docs.ray.io/en/latest/serve/index.html). In this tutorial, you will not only learn how to harness the power of Stable Diffusion models, but also gain insights into the intricacies of deploying large language models (LLMs) efficiently, particularly on [trn1/inf2](https://aws.amazon.com/machine-learning/neuron/) (powered by AWS Trainium and Inferentia) instances, such as `inf2.24xlarge` and `inf2.48xlarge`, which are optimized for deploying and scaling large language models. @@ -135,7 +135,7 @@ aws eks --region us-west-2 update-kubeconfig --name trainium-inferentia **Deploy RayServe Cluster** ```bash -cd ../../gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2 +cd data-on-eks/gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2 kubectl apply -f ray-service-stablediffusion.yaml ``` @@ -217,7 +217,7 @@ Let's move forward with setting up the Gradio app as a Docker container running First, lets build the docker container for the client app. ```bash -cd ../gradio-ui +cd data-on-eks/gen-ai/inference/gradio-ui docker build --platform=linux/amd64 \ -t gradio-app:sd \ --build-arg GRADIO_APP="gradio-app-stable-diffusion.py" \ @@ -276,7 +276,7 @@ docker rmi gradio-app:sd **Step2:** Delete Ray Cluster ```bash -cd ../stable-diffusion-xl-base-rayserve-inf2 +cd data-on-eks/gen-ai/inference/stable-diffusion-xl-base-rayserve-inf2 kubectl delete -f ray-service-stablediffusion.yaml ``` @@ -284,6 +284,6 @@ kubectl delete -f ray-service-stablediffusion.yaml This script will cleanup the environment using `-target` option to ensure all the resources are deleted in correct order. ```bash -cd ../../../ai-ml/trainium-inferentia/ +cd data-on-eks/ai-ml/trainium-inferentia/ ./cleanup.sh ``` diff --git a/website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md b/website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md index c2ca2c815..9e9254001 100644 --- a/website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md +++ b/website/docs/gen-ai/inference/Neuron/vllm-ray-inf2.md @@ -163,7 +163,7 @@ Having deployed the EKS cluster with all the necessary components, we can now pr This will apply the RayService configuration and deploy the cluster on your EKS setup. ```bash -cd ../../gen-ai/inference/vllm-rayserve-inf2 +cd data-on-eks/gen-ai/inference/vllm-rayserve-inf2 kubectl apply -f vllm-rayserve-deployment.yaml ``` @@ -258,7 +258,7 @@ kubectl -n vllm port-forward svc/vllm-llama3-inf2-serve-svc 8000:8000 To run the Python client application in a virtual environment, follow these steps: ```bash -cd gen-ai/inference/vllm-rayserve-inf2 +cd data-on-eks/gen-ai/inference/vllm-rayserve-inf2 python3 -m venv .venv source .venv/bin/activate pip3 install openai @@ -588,13 +588,6 @@ Each of these files contain the following Performance Benchmarking Metrics: ```results_number_output_tokens_*```: Number of output tokens in the requests (Output length) -## Cleanup - -To remove all resources created by this deployment, run: - -```bash -./cleanup.sh -``` ## Conclusion In summary, when it comes to deploying and scaling Llama-3, AWS Trn1/Inf2 instances offer a compelling advantage. @@ -615,7 +608,7 @@ kubectl delete -f vllm-rayserve-deployment.yaml Destroy the EKS Cluster and resources ```bash -cd ../../../ai-ml/trainium-inferentia/ +cd data-on-eks/ai-ml/trainium-inferentia/ ./cleanup.sh ``` diff --git a/website/docs/resources/binpacking-custom-scheduler-eks.md b/website/docs/resources/binpacking-custom-scheduler-eks.md index 85b5319ec..a3d649460 100644 --- a/website/docs/resources/binpacking-custom-scheduler-eks.md +++ b/website/docs/resources/binpacking-custom-scheduler-eks.md @@ -10,9 +10,9 @@ sidebar_label: Bin packing for Amazon EKS In this post, we will show you how to enable a custom scheduler with Amazon EKS when running DoEKS especially for Spark on EKS, including OSS Spark and EMR on EKS. The custom scheduler is a custom Kubernetes scheduler with ```MostAllocated``` strategy running in data plane. ### Why bin packing -By default, the [scheduling-plugin](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins) NodeResourcesFit use the ```LeastAllocated``` for score strategies. For the long running workloads, that is good because of high availability. But for batch jobs, like Spark workloads, this would lead high cost. By changing the from ```LeastAllocated``` to ```MostAllocated```, it avoids spreading pods across all running nodes, leading to higher resource utilization and better cost efficiency. +By default, the [scheduling-plugin](https://kubernetes.io/docs/reference/scheduling/config/#scheduling-plugins) NodeResourcesFit use the ```LeastAllocated``` for score strategies. For the long running workloads, that is good because of high availability. But for batch jobs, like Spark workloads, this would lead high cost. By changing the from ```LeastAllocated``` to ```MostAllocated```, it avoids spreading pods across all running nodes, leading to higher resource utilization and better cost efficiency. -Batch jobs like Spark are running on demand with limited or predicted time. With ```MostAllocated``` strategy, Spark executors are always bin packing into one node util the node can not host any pods. You can see the following picture shows the +Batch jobs like Spark are running on demand with limited or predicted time. With ```MostAllocated``` strategy, Spark executors are always bin packing into one node util the node can not host any pods. You can see the following picture shows the ```MostAllocated``` in EMR on EKS. @@ -71,12 +71,12 @@ spec: volumes: - name: spark-local-dir-1 hostPath: - path: /local1 - initContainers: + path: /local1 + initContainers: - name: volume-permission image: public.ecr.aws/docker/library/busybox # grant volume access to hadoop user - command: ['sh', '-c', 'if [ ! -d /data1 ]; then mkdir /data1;fi; chown -R 999:1000 /data1'] + command: ['sh', '-c', 'if [ ! -d /data1 ]; then mkdir /data1;fi; chown -R 999:1000 /data1'] volumeMounts: - name: spark-local-dir-1 mountPath: /data1