ElasticDL is a Kubernetes-native machine learning framework. This document explains how to run an ElasticDL job on a public cloud, namely, Google Kubernetes Engine (GKE).
First, we create a new project for elasticdl in web console and a new Kubernetes cluster under this project.
We will use the project id and cluster name in next steps.
To access GKE in a local computer, we need to install Google Cloud
SDK, which includes command-line tools
like gcloud
.
Luckily, Google Cloud also provides Cloud Shell with gcloud
installed already.
In this tutorial, we use Cloud Shell to access the Kubernetes cluster.
We run the following command in Cloud Shell.
export PROJECT_ID=${your_project_id}
gcloud container clusters get-credentials cluster-1 --zone us-central1-c --project ${PROJECT_ID}
ElasticDL jobs require pod creation and deletion permissions. Make sure you have granted related permissions to the default or other related service accounts.
export CODE_PATH=${your_code_dir}
cd ${CODE_PATH} && git clone https://github.com/sql-machine-learning/elasticdl.git
cd ${CODE_PATH}/elasticdl
kubectl apply -f elasticdl/manifests/elasticdl-rbac.yaml
ElasticDL supports elastic scheduling, and works well the priority-based scheduling of Kubernetes. We create two customized PriorityClass in the cluster, high and low.
high.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high
value: 1000000
globalDefault: false
low.yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low
value: 1000
globalDefault: false
kubectl create -f high.yaml
kubectl create -f low.yaml
First, we create a Cloud Filestore instance in web console.
Then we follow the doc to access fileshares from the Kubernetes cluster.
In this example, we create a persistent volume claim named fileserver-claim
.
Step 1: We generate MNIST training and evaluation data in RecordIO format.
docker run --rm -it \
-v $HOME/.keras:/root/.keras \
-v $PWD:/work \
-w /work \
elasticdl/elasticdl:dev bash -c "/scripts/gen_dataset.sh data"
The RecordIO format dataset will generated in the data
directory.
Step 2: We launch a pod which mounts the volume, and use kubectl cp
command
to copy MNIST dataset from local to the volume.
kubectl create -f my-pod.yaml
kubectl cp data/mnist my-pod:/data
my-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: my-pod
spec:
containers:
- name: test-pod
image: nginx:1.7.9
volumeMounts:
- mountPath: /data
name: mypvc
volumes:
- name: mypvc
persistentVolumeClaim:
claimName: fileserver-claim
readOnly: false
Please refer to elasticdl_local tutorial for more details. The difference is that we have to push the image to google cloud repo.
pip install elasticdl-client
cd ${CODE_PATH}/elasticdl/model_zoo
elasticdl zoo init
elasticdl zoo build --image=gcr.io/${PROJECT_ID}/elasticdl:mnist .
elasticdl zoo push gcr.io/${PROJECT_ID}/elasticdl:mnist
We launch a training job with 2 PS pods and 4 worker pods. The master pod and PS pods are set with high priority, while worker pods are set with low priority.
elasticdl train \
--image_name=gcr.io/${PROJECT_ID}/elasticdl:mnist \
--model_zoo=model_zoo \
--model_def=mnist.mnist_functional_api.custom_model \
--training_data=/data/mnist/train \
--validation_data=/data/mnist/test \
--num_epochs=5 \
--master_resource_request="cpu=2,memory=2048Mi" \
--master_resource_limit="cpu=2,memory=2048Mi" \
--master_pod_priority=high \
--worker_resource_request="cpu=2,memory=2048Mi" \
--worker_resource_limit="cpu=2,memory=2048Mi" \
--worker_pod_priority=low \
--ps_resource_request="cpu=2,memory=2048Mi" \
--ps_resource_limit="cpu=2,memory=2048Mi" \
--ps_pod_priority=high \
--minibatch_size=64 \
--num_minibatches_per_task=64 \
--num_ps_pods=2 \
--num_workers=4 \
--evaluation_steps=200 \
--grads_to_wait=1 \
--job_name=test-mnist \
--log_level=INFO \
--image_pull_policy=Always \
--volume="mount_path=/data,claim_name=fileserver-claim" \
--distribution_strategy=ParameterServerStrategy
To see the status of each pod:
kubectl get pods
To see the loss in a worker pod:
kubectl logs elasticdl-test-mnist-worker-0 | grep "Loss"
To see the evaluation metrics in the master pod:
kubectl logs elasticdl-test-mnist-master | grep "Evaluation"
ElasticDL supports fault tolerance in distributed training. When a worker pod is killed, the training job does not crash and the master pod will try to relaunch a new worker pod.
At first, all pods are running:
elasticdl-test-mnist-master 1/1 Running 0 35s
elasticdl-test-mnist-ps-0 1/1 Running 0 29s
elasticdl-test-mnist-ps-1 1/1 Running 0 28s
elasticdl-test-mnist-worker-0 1/1 Running 0 28s
elasticdl-test-mnist-worker-1 1/1 Running 0 28s
elasticdl-test-mnist-worker-2 1/1 Running 0 28s
elasticdl-test-mnist-worker-3 1/1 Running 0 28s
Then, we delete a worker pod:
kubectl delete pod elasticdl-test-mnist-worker-0
The master pod creates a new worker pod elasticdl-test-mnist-worker-4
at once.
NAME READY STATUS RESTARTS AGE
elasticdl-test-mnist-master 1/1 Running 0 51s
elasticdl-test-mnist-ps-0 1/1 Running 0 45s
elasticdl-test-mnist-ps-1 1/1 Running 0 44s
elasticdl-test-mnist-worker-1 1/1 Running 0 44s
elasticdl-test-mnist-worker-2 1/1 Running 0 44s
elasticdl-test-mnist-worker-3 1/1 Running 0 44s
elasticdl-test-mnist-worker-4 1/1 Running 0 6s
After we launch the MNIST training job, we launch another nginx service with high priority in the same cluster.
nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-nginx
labels:
app: nginx
spec:
replicas: 5
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9
imagePullPolicy: Always
ports:
- containerPort: 80
resources:
limits:
cpu: 2
memory: 2048Mi
ephemeral-storage: 1024Mi
requests:
cpu: 2
memory: 2048Mi
ephemeral-storage: 1024Mi
priorityClassName: high
restartPolicy: Always
kubectl create -f nginx.yaml
We will find that some worker pods with low priority are preempted by nginx pods with high priority.
NAME READY STATUS RESTARTS AGE
elasticdl-test-mnist-master 1/1 Running 0 34s
elasticdl-test-mnist-ps-0 1/1 Running 0 27s
elasticdl-test-mnist-ps-1 1/1 Running 0 27s
elasticdl-test-mnist-worker-0 1/1 Running 0 27s
elasticdl-test-mnist-worker-1 1/1 Terminating 0 27s
elasticdl-test-mnist-worker-2 1/1 Terminating 0 27s
elasticdl-test-mnist-worker-3 1/1 Terminating 0 26s
test-nginx-7585fc5976-5hs7h 1/1 Running 0 2s
test-nginx-7585fc5976-9s4nx 1/1 Running 0 2s
test-nginx-7585fc5976-bf2th 0/1 Pending 0 2s
test-nginx-7585fc5976-ckd94 0/1 Pending 0 2s
test-nginx-7585fc5976-ss8pk 0/1 Pending 0 2s
After preemption, the training job still goes on with one worker pod.
elasticdl-test-mnist-master 1/1 Running 0 61s
elasticdl-test-mnist-ps-0 1/1 Running 0 54s
elasticdl-test-mnist-ps-1 1/1 Running 0 54s
elasticdl-test-mnist-worker-0 1/1 Running 0 54s
elasticdl-test-mnist-worker-4 0/1 Pending 0 26s
elasticdl-test-mnist-worker-5 0/1 Pending 0 26s
elasticdl-test-mnist-worker-6 0/1 Pending 0 26s
test-nginx-7585fc5976-5hs7h 1/1 Running 0 29s
test-nginx-7585fc5976-9s4nx 1/1 Running 0 29s
test-nginx-7585fc5976-bf2th 1/1 Running 0 29s
test-nginx-7585fc5976-ckd94 1/1 Running 0 29s
test-nginx-7585fc5976-ss8pk 1/1 Running 0 29s
Then, we scale the nginx deployment down to 1 replica. Some cluster resources are freed.
kubectl scale deployment.v1.apps/test-nginx --replicas=1
We find that the training job takes over the freed resources, and goes on with 4 worker pods.
NAME READY STATUS RESTARTS AGE
elasticdl-test-mnist-master 1/1 Running 0 2m3s
elasticdl-test-mnist-ps-0 1/1 Running 0 116s
elasticdl-test-mnist-ps-1 1/1 Running 0 116s
elasticdl-test-mnist-worker-0 1/1 Running 0 116s
elasticdl-test-mnist-worker-4 1/1 Running 0 88s
elasticdl-test-mnist-worker-5 1/1 Running 0 88s
elasticdl-test-mnist-worker-6 1/1 Running 0 88s
test-nginx-7585fc5976-5hs7h 1/1 Running 0 91s