This repository contains all necessary files and instructions to deploy a streaming Automatic Speech Recognition (ASR) service with AWS Elastic Kubernetes Service (EKS).
Make sure to install the following tools:
Ensure the following policies are attached to your IAM user:
AmazonEBSCSIDriverPolicy
AmazonEKS_CNI_Policy
AmazonEKSBlockStoragePolicy
AmazonEKSClusterPolicy
AmazonEKSComputePolicy
AmazonEKSFargatePodExecutionRolePolicy
AmazonEKSLoadBalancingPolicy
AmazonEKSLocalOutpostClusterPolicy
AmazonEKSNetworkingPolicy
AmazonEKSServicePolicy
AmazonEKSVPCResourceController
AmazonEKSWorkerNodeMinimalPolicy
AmazonEKSWorkerNodePolicy
AmazonSSMManagedInstanceCore
AutoScalingFullAccess
AWSFaultInjectionSimulatorEKSAccess
File/Folder | Description |
---|---|
asr.py |
ASR model class. Modify this file to use your own ASR model. Implement predict(self, messages) method. |
asr_server.py |
FastAPI application endpoint, manages request queue and client pool. |
audios/ |
Sample audios for testing. |
client_logs/ |
Logging for client call results. |
client.py |
Simple script for sending streaming audios and receiving ASR transcription. |
client.sh |
Bash script for simulating multiple client calls. |
config.py |
Configuration for the ASR server and clients. |
cluster.yaml |
Defines EKS cluster parameters. |
deployment.yaml |
ASR pod deployment configuration. |
Dockerfile |
For building the ASR Docker image. |
requirements.txt |
Python dependencies for Docker image. |
ebs-csi-policy.json |
EBS CSI driver policy. |
gp2-immediate.yaml |
Custom persistent volume configuration. |
hpa.yaml |
KEDA-based HPA configuration. |
iam_policy.json |
IAM policy for load balancing. |
ingress.yaml |
ALB Ingress configuration. |
pvc.yaml |
Persistent Volume Claim configuration. |
service-monitor.yaml |
Prometheus service monitor config. |
service.yaml |
ASR service definition. |
time-slicing-config-all.yaml |
NVIDIA GPU time-sharing config. |
CLUSTER_NAME="asr-cluster"
REGION="eu-west-1"
VPC_ID="YOUR_VPC"
ACCOUNT_ID="YOUR_ACCOUNT_ID"
POLICY_NAME="AWSLoadBalancerControllerIAMPolicy"
SA_NAME="aws-load-balancer-controller"
Edit and adjust parameters in cluster.yaml as needed.
eksctl create cluster -f cluster.yaml
aws eks --region $REGION update-kubeconfig --name $CLUSTER_NAME
helm repo add eks https://aws.github.io/eks-charts
helm repo update
curl -o iam_policy.json https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json
aws iam create-policy \
--policy-name $POLICY_NAME \
--policy-document file://iam_policy.json
eksctl utils associate-iam-oidc-provider --region $REGION --cluster $CLUSTER_NAME --approve
eksctl create iamserviceaccount \
--cluster $CLUSTER_NAME \
--namespace kube-system \
--name $SA_NAME \
--attach-policy-arn arn:aws:iam::$ACCOUNT_ID:policy/$POLICY_NAME \
--approve
helm install $SA_NAME eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=$CLUSTER_NAME \
--set serviceAccount.create=false \
--set serviceAccount.name=$SA_NAME \
--set region=$REGION \
--set vpcId=$VPC_ID
Already implemented, files to modify if needed:
- asr_server.py
- asr.py
- config.py
- client.py
- client.sh
We use the FastConformer model from Nemo ASR hub as the serving ASR model. You can change the ASR model by modifying asr.py file.
Change to your docker_repo if needed. If you want to use my already built Docker image, you can skip this step
DOCKER_REPO="mailong25/asr-server"
docker build -t $DOCKER_REPO .
docker push $DOCKER_REPO
Edit deployment.yaml, service.yaml, and ingress.yaml if needed.
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
kubectl get pods -A
- Locate Load Balancer DNS from your AWS Console.
- Update SERVER_URI = Load Balancer DNS in config.py.
- Run a test:
python client.py
This is useful for sharing NVIDIA GPU between pods
kubectl create -n kube-system -f time-slicing-config-all.yaml
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.14.1 \
--namespace kube-system \
--create-namespace \
--set config.name=time-slicing-config-all
kubectl scale deployment asr-deployment --replicas=2
kubectl get pods -A
eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$CLUSTER_NAME --approve
eksctl create iamserviceaccount \
--region $REGION \
--name ebs-csi-controller-sa \
--namespace kube-system \
--cluster $CLUSTER_NAME \
--attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
--approve \
--role-only \
--role-name AmazonEKS_EBS_CSI_DriverRole
eksctl create addon --name aws-ebs-csi-driver --cluster $CLUSTER_NAME \
--service-account-role-arn arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/AmazonEKS_EBS_CSI_DriverRole --force
kubectl apply -f gp2-immediate.yaml
kubectl apply -f pvc.yaml
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack
Create a ServiceMonitor:
kubectl apply -f service-monitor.yaml
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090
For monitoring, in browser (http://localhost:9090), run this to check average response latency
avg(rate(response_latency_seconds_sum[2m]) / clamp_min(rate(response_latency_seconds_count[2m]), 1e-9))
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda
kubectl apply -f hpa.yaml
Now spawn multiple clients to test if the HPA work
sh client.sh 5 5
aws iam create-policy \
--policy-name AmazonEKSClusterAutoscalerPolicy \
--policy-document file://<(cat <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
],
"Resource": "*"
}
]
}
EOF
)
eksctl utils associate-iam-oidc-provider --region=$REGION --cluster=$CLUSTER_NAME --approve
eksctl create iamserviceaccount \
--cluster $CLUSTER_NAME \
--namespace kube-system \
--name cluster-autoscaler \
--attach-policy-arn arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKSClusterAutoscalerPolicy \
--approve \
--override-existing-serviceaccounts
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm upgrade --install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
--set autoDiscovery.clusterName=$CLUSTER_NAME \
--set awsRegion=$REGION \
--set rbac.serviceAccount.name=cluster-autoscaler \
--set rbac.serviceAccount.create=false \
--set fullnameOverride=cluster-autoscaler \
--set extraArgs.balance-similar-node-groups=true \
--set extraArgs.skip-nodes-with-system-pods=false \
--set extraArgs.scale-down-enabled=true \
--set extraArgs.scale-down-unneeded-time=2m \
--set extraArgs.scale-down-delay-after-delete=2m \
--set extraArgs.scale-down-delay-after-add=2m \
--set extraArgs.max-node-provision-time=15m \
--set extraArgs.scan-interval=1m
This will gradually spawn 40 clients to call the ASR server.
sh client.sh 40 30
The Cluster should automatically scale up the number of pods/nodes during peak traffic periods. Then gradually scale down pods/nodes as the number of requests decreasing. You can use Prometheus for monitoring the number of requests/latency from each pod:
- Total Requests per Minute:
sum(rate(response_latency_seconds_count[1m]))
- Average Response Latency:
avg(rate(response_latency_seconds_sum[1m]) / clamp_min(rate(response_latency_seconds_count[1m]), 1e-9))