- Create cluster
- Install ArgoCD
cd install
kubectl create namespace argocd
- Apply secrets
kubectl apply -f ../secrets/repo-secret.yaml
kubectl apply -f ../secrets/sops-age-secret.yaml
helm install argo-cd argo-cd/ --namespace argocd --values values-override.yaml
- Expose ArgoCD to loadbalencer
kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'
- Login to ArgoCD
kubectl port-forward svc/argocd-server -n argocd 8080:443
argocd admin initial-password -n argocd
- Login with admin:
- Change password
- Allow ArgoCD to manage itself
cd install
helm upgrade argocd ./argo-cd --namespace=argocd --create-namespace -f values-override.yaml
- Install nvidia driver on agent 1
sudo apt install gcc build-essential -y && wget http://international.download.nvidia.com/XFree86/Linux-x86_64/545.29.06/NVIDIA-Linux-x86_64-545.29.06.run && sudo chmod +x NVIDIA-Linux-x86_64-545.29.06.run && sudo ./NVIDIA-Linux-x86_64-545.29.06.run
- Let the installer disable nouveau, patch initramfs, then reboot
sudo ./NVIDIA-Linux-x86_64-545.29.06.run
- reboot
- Apply nvidia driver patches
sudo mkdir /opt/nvidia && sudo chown -R $USER /opt/nvidia && cd /opt/nvidia && git clone https://github.com/keylase/nvidia-patch && cd nvidia-patch && sudo bash patch.sh && sudo bash patch-fbc.sh
- Install nvidia container toolkit
- Follow https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#installing-with-apt
- Or just run this:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list && sudo apt update && sudo apt install -y nvidia-container-toolkit
- Install nvidia plugin
- Test cuda
- Pull image
sudo ctr image pull docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04
- Run image
sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi
- Pull image
- Reboot agent 1
- Test cuda on kubernetes
cat <<EOF | kubectl create -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: Never containers: - name: cuda-container image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 command: [ "/bin/bash", "-c", "--" ] args: [ "while true; do sleep 30; done;" ] env: - name: NVIDIA_VISIBLE_DEVICES value: "all" - name: NVIDIA_DRIVER_CAPABILITIES value: "all" # Try to avoid using resource limits, so we can use the gpu on multiple containers #resources: # limits: # nvidia.com/gpu: 1 # requesting 1 GPU tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule EOF
Service | Namespace | Description | URL |
---|---|---|---|
ArgoCD | argocd | GitOps | https://localhost:8080 (Proxy) |
Ksops | N/A | Sops wrapper for kubernetes secrets | N/A |
Plex (Published) | media | Media server | https://app.plex.tv |
Plex (Local) | media | Media server | https://app.plex.tv |
Longhorn | longhorn-system | Replicating local storage | N/A |
local-storage-provisioner | local-path-storage | Local storage provisioner | N/A |
Accomplice-V2 | discord | Home grown Discord leaderboard bot | N/A |
MetalLB | metallb-system | Loadbalencer | N/A |
Nvidia Device Plugin | system | Nvidia GPU support | N/A |
Traefik | traefik | Ingress controller | N/A |