Please refer to NTHU-LSALAB/KubeShare
Make GPU shareable in Kubernetes
- Compatible with native K8s resource management
- Fine-grained resource definition
- Support GPU compute and memory limit
- Support multiple GPU in a node
- Avoid GPU fragmentation problem
- Support GPU namespace
Follow this link
- pkg: golang packages for K8s generated by code-generator release-1.14 (6c2a4329ac29)
- Only support Nvidia GPU device plugin with nvidia-docker2 in K8s.
Not compatible with docker (version>=19) using newer GPU resource API. - Currently only support cuda 9.0
- Require Kubernetes version >= 1.10 (device plugin & CRD)
- A K8s cluster with Nvidia GPU device plugin.
- kubectl with admin permissions.
kubectl create -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/crd.yaml
kubectl create -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/controller.yaml
kubectl create -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/daemonset.yaml
In order to add extra information (some environment variables, volume mounts needed by shareable GPU are immutable after Pod had been created) to Pods created by user, we create a new CustomResourceDefinition (CRD) named MtgpuPod (Multi-tenant GPU Pod) as the basic execution unit which is originally represented by Pod.
An example for MtgpuPod spec:
apiVersion: lsalab.nthu/v1
kind: MtgpuPod
metadata:
name: pod1
annotations:
"lsalab.nthu/gpu_request": "0.5"
"lsalab.nthu/gpu_limit": "1.0"
"lsalab.nthu/gpu_mem": "1073741824" # 1Gi, in bytes
"lsalab.nthu/GPUID": "abc"
spec:
nodeName: node1 # must be assigned
containers:
- name: sleep
image: nvidia/cuda:9.0-base
command: ["sh", "-c"]
args:
- 'nvidia-smi -L'
resources:
requests:
cpu: "1"
memory: "500Mi"
limits:
cpu: "1"
memory: "500Mi"
Because floating point custom device requests is forbidden by K8s, we move GPU resource usage definitions to Annotations.
- lsalab.nthu/gpu_request: guaranteed GPU usage of Pod, gpu_request <= "1.0".
- lsalab.nthu/gpu_limit: maximum extra usage if GPU still has free resources, gpu_request <= gpu_limit <= "1.0".
- lsalab.nthu/gpu_mem: maximum GPU memory usage of Pod.
- lsalab.nthu/GPUID: described in section Controlling everything of shareable GPU.
- spec is a normal PodSpec definition to be deployed.
- spec.nodeName must be assigned (a deployed MtgpuPod must be scheduled). More information described in section Cluster resources accounting.
A lsalab.nthu/GPUID (abbrv. GPUID) value is generated by random string of length 5, which temporary represents a physical GPU card. A GPUID value is available only if there is at least one Pod using the GPUID on the Node. The same GPUID can be assigned by multiple Pods simultaneously if the Pods want to use the same physical GPU card.
GPUID must be unique in Node scope.
An example for controlling GPU sharing: (one node with two physical GPU)
GPU1 GPU2
+--------------+ +--------------+
| | | |
| | | |
| | | |
| | | |
| | | |
+--------------+ +--------------+
randomString(5): zxcvb (wants a new physical GPU)
Create Pod1 gpu_request:0.2 GPUID:zxcvb
GPU1 GPU2(zxcvb)
+--------------+ +--------------+
| | | Pod1:0.2 |
| | | |
| | | |
| | | |
| | | |
+--------------+ +--------------+
randomString(5): qwert (don't want to share with Pod1, wants a new physical GPU)
Create Pod2 gpu_request:0.3 GPUID:qwert
GPU1(qwert) GPU2(zxcvb)
+--------------+ +--------------+
| Pod2:0.3 | | Pod1:0.2 |
| | | |
| | | |
| | | |
| | | |
+--------------+ +--------------+
Create Pod3 gpu_request:0.4 GPUID:zxcvb (wants to share with Pod1)
GPU1(qwert) GPU2(zxcvb)
+--------------+ +--------------+
| Pod2:0.3 | | Pod1:0.2 |
| | | Pod3:0.4 |
| | | |
| | | |
| | | |
+--------------+ +--------------+
Delete Pod2 (GPUID qwert is no longer available)
GPU1 GPU2(zxcvb)
+--------------+ +--------------+
| | | Pod1:0.2 |
| | | Pod3:0.4 |
| | | |
| | | |
| | | |
+--------------+ +--------------+
randomString(5): asdfg (don't want to share with Pod1 and Pod3, wants a new physical GPU)
Create Pod4 gpu_request:0.5 GPUID:asdfg
GPU1(asdfg) GPU2(zxcvb)
+--------------+ +--------------+
| Pod4:0.5 | | Pod1:0.2 |
| | | Pod3:0.4 |
| | | |
| | | |
| | | |
+--------------+ +--------------+
To prevent K8s default-scheduler cannot recognize MtgpuPods, causing that default-scheduler schedules Pods to Node whose physcial GPUs are used by MtgpuPod, we run an Occupy-Pod with 1 Nvidia device plugin GPU request (spec.containers[0].resources.requests: "nvidia.com/gpu": 1) when a new GPUID was generated, telling default-scheduler that GPU is wanted by MtgpuPod system.
Occupy-Pods was created in kube-system namespace, in naming format: mtgpupod-occupypod-{NodeName}-{GPUID}.
For some reasons, K8s default-scheduler doesn't support shareable custom devices now. It's necessary to run a MtgpuPod-scheduler for automatically deploying MtgpuPod. Although we may provide a basic MtgpuPod-scheduler in future that only support resource request scheduling, we still describe the method (pseudo-code) for accounting shareable custom device resources.
GPUResources := list of free GPU of every Node
for each Node:
availableGPU := available (total) GPU on Node
allocatedGPUMap := map of GPUID=>usage
for each Pod on Node:
if ! Pod.Name.Contains("mtgpupod-occupypod") // avoid repeat calculate to MtgpuPod
availableGPU -= sum of "nvidia.com/gpu" request of containers in Pod
for each MtgpuPod on Node:
if GPUID of MtgpuPod in allocatedGPUMap exists:
allocatedGPUMap.Get(GPUID) -= GPU request of MtgpuPod
else:
availableGPU -= 1
allocatedGPUMap.Add(GPUID)
allocatedGPUMap.Get(GPUID) = 1.0 - GPU request of MtgpuPod
GPUResources.Add(availableGPU, allocatedGPUMap)
kubectl delete -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/crd.yaml
kubectl delete -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/controller.yaml
kubectl delete -f https://lsalab.cs.nthu.edu.tw/~ericyeh/gpusharing/daemonset.yaml
Currently the GPU memory usage control may not work properly.