This guideline is for configuring fiab in Amazon Linux 2 with GPU supported instance types (e.g., p2).
This section is specifically for AWS EC2 instance with Amazon Linux 2 image and GPU. For other Linux distributions without GPU (regardless of VM or baremetal machine), refer to Ubuntu; with their respective package manager, the guideline for Ubuntu can be easily followed.
For Amazon Linux 2 image (amzn2), the following tools are necessary: minikube
, kubectl
, helm
, cri-dockerd
, crictl
, docker
and jq
.
The image was tested under an ec2 instance with GPU (e.g., p2 instances).
To set up fiab, run install.sh
under the fiab folder.
cd fiab
./install.sh amzn2
Note: If install.sh is executed, the below steps in the prerequisites section must be skipped. Go to the staring minikube part.
This prerequisites part should be executed only once.
The following shows the steps incorporated in the install.sh
script,
which can be manually followed to understand what the script does.
Install docker as per this document.
# set up golang compilation env
wget https://storage.googleapis.com/golang/getgo/installer_linux
chmod +x ./installer_linux
./installer_linux
source ~/.bash_profile
# download cri-docker
git clone https://github.com/Mirantis/cri-dockerd.git
cd cri-dockerd
mkdir bin
go build -o bin/cri-dockerd
# install cri-docker
sudo install -o root -g root -m 0755 bin/cri-dockerd /usr/bin/cri-dockerd
sudo cp -a packaging/systemd/* /etc/systemd/system
sudo systemctl daemon-reload
sudo systemctl enable cri-docker.service
sudo systemctl enable --now cri-docker.socket
# install crictl
VERSION="v1.25.0"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/$VERSION/crictl-$VERSION-linux-amd64.tar.gz
sudo tar zxvf crictl-$VERSION-linux-amd64.tar.gz -C /usr/local/bin
rm -f crictl-$VERSION-linux-amd64.tar.gz
# install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-latest.x86_64.rpm
sudo rpm -Uvh minikube-latest.x86_64.rpm
# install kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# install helm
HELM_VERSION=v3.10.2
curl -LO https://get.helm.sh/helm-$HELM_VERSION-linux-amd64.tar.gz
tar -zxvf helm-$HELM_VERSION-linux-amd64.tar.gz
sudo mv linux-amd64/helm /usr/local/bin/helm
A minikube environment is resuable until it is deleted by executing minikube delete
.
If the minikube env is destroyed, this step needs to be executed.
If it is stopped by running sudo minikube stop
, one can simply restart it by running sudo minikube start
without need to follow the steps below.
sudo minikube start --driver=none --apiserver-ips 127.0.0.1 --apiserver-name localhost --cni=bridge
Note: If Exiting due to HOST_JUJU_LOCK_PERMISSION
error happens, run the following command:
sudo sysctl fs.protected_regular=0
Run the following commands to ensure that kubectl
can be executed without sudo
:
# remove any old config
rm -rf $HOME/.kube $HOME/.minikube
# transfer config to a normal user so that kubectl commands can be executed without sudo
sudo cp -rf /root/.kube /root/.minikube $HOME
sudo chown -R $USER $HOME/.kube $HOME/.minikube
# update the cert file's location correctly
sed -i 's@/root@'"$HOME"'@' $HOME/.kube/config
- If NVIDIA's GPU is available in the machine, run the following command to install nvidia device plugin:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
- To check if GPUs are enabled, run the following command:
kubectl get nodes -ojson | jq .items[].status.capacity
An output should look similar to:
{
"cpu": "4",
"ephemeral-storage": "524275692Ki",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "62766704Ki",
"nvidia.com/gpu": "1",
"pods": "110"
}
More details are found here.
Deploy Node Feature Discovery (NFD) as a daemonset.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/nfd.yaml
Deploy NVIDIA GPU Feature Discovery (GFD) as a daemonset.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/gpu-feature-discovery/v0.7.0/deployments/static/gpu-feature-discovery-daemonset.yaml
kubectl get nodes -o yaml
The above command will output something similar to the following:
apiVersion: v1
items:
- apiVersion: v1
kind: Node
metadata:
...
labels:
...
nvidia.com/cuda.driver.major: "470"
nvidia.com/cuda.driver.minor: "57"
nvidia.com/cuda.driver.rev: "02"
nvidia.com/cuda.runtime.major: "11"
nvidia.com/cuda.runtime.minor: "4"
nvidia.com/gfd.timestamp: "1672792567"
nvidia.com/gpu.compute.major: "3"
nvidia.com/gpu.compute.minor: "7"
nvidia.com/gpu.count: "1"
nvidia.com/gpu.family: kepler
nvidia.com/gpu.machine: HVM-domU
nvidia.com/gpu.memory: "11441"
nvidia.com/gpu.product: Tesla-K80
nvidia.com/gpu.replicas: "1"
nvidia.com/mig.capable: "false"
...
...
Next, ingress
and ingress-dns
addons need to be installed with the following command:
sudo minikube addons enable ingress
sudo minikube addons enable ingress-dns
As a final step, a cert manager is needed to enable tls. The setup-cert-manager.sh
script installs and configures a cert manager for
selfsigned certificate creation. Run the following command:
./setup-cert-manager.sh
To simply use flame, skip this step and go to the starting flame step. Building flame container image is only needed if a developer makes changes in the source code and wants to test the local changes.
In order to build flame container image, run the following:
./build-image.sh
To check the flame image built, run docker images
. An output is similar to:
REPOSITORY TAG IMAGE ID CREATED SIZE
flame latest e3bf47cdfa66 22 seconds ago 3.96GB
k8s.gcr.io/kube-apiserver v1.22.3 53224b502ea4 7 weeks ago 128MB
k8s.gcr.io/kube-scheduler v1.22.3 0aa9c7e31d30 7 weeks ago 52.7MB
k8s.gcr.io/kube-controller-manager v1.22.3 05c905cef780 7 weeks ago 122MB
k8s.gcr.io/kube-proxy v1.22.3 6120bd723dce 7 weeks ago 104MB
kubernetesui/dashboard v2.3.1 e1482a24335a 6 months ago 220MB
k8s.gcr.io/etcd 3.5.0-0 004811815584 6 months ago 295MB
kubernetesui/metrics-scraper v1.0.7 7801cfc6d5c0 6 months ago 34.4MB
k8s.gcr.io/coredns/coredns v1.8.4 8d147537fb7d 6 months ago 47.6MB
gcr.io/k8s-minikube/storage-provisioner v5 6e38f40d628d 8 months ago 31.5MB
k8s.gcr.io/pause 3.5 ed210e3e4a5b 9 months ago 683kB
Open a new terminal window and start the minikube tunnel with the following command:
sudo minikube tunnel
The tunnel creates a routable IP for deployment.
To bring up flame and its dependent applications, helm
is used.
A shell script (flame.sh
) to use helm is provided.
Run the following command:
./flame.sh start
The above command ensures that the latest official flame image from docker hub is used.
To use a locally developed image, add --local-img
in the above command.
Note: The following error may occur during the start.
Error: INSTALLATION FAILED: failed post-install: timed out waiting for the condition
This issue may be because container images are large or the Internet connection is slow.
The issue has been reported in minikube github.
The latest minikube still doesn't contain the patched component (cri-dockerd 0.2.6).
A workaround is to pull images manually (e.g. minikube ssh docker pull ciscoresearch/flame:v0.2.2
).
The command kubectl get pods -n flame
gives a list of pods and their status.
The pods with ErrImagePull
or ImagePullBackOff
status are ones that might be affected by the issue.
Identifying the required image can be done by running a kubectl describe
command
(e.g., kubectl describe pod -n flame flame-apiserver-5df5fb6bc4-22z6l
);
the command's output will show details about the pod, including image name and its tag.
To check deployment status, run the following command:
kubectl get pods -n flame
An example output looks like the following:
NAME READY STATUS RESTARTS AGE
flame-apiserver-5df5fb6bc4-22z6l 1/1 Running 0 7m5s
flame-controller-566684676b-g4pwr 1/1 Running 6 (4m4s ago) 7m5s
flame-mlflow-965c86b47-vd8th 1/1 Running 0 7m5s
flame-mongodb-0 1/1 Running 0 3m41s
flame-mongodb-1 1/1 Running 0 4m3s
flame-mongodb-arbiter-0 1/1 Running 0 7m5s
flame-mosquitto-6754567c88-rfmk7 1/1 Running 0 7m5s
flame-mosquitto2-676596996b-d5dzj 1/1 Running 0 7m5s
flame-notifier-cf4854cd9-g27wj 1/1 Running 0 7m5s
postgres-7fd96c847c-6qdpv 1/1 Running 0 7m5s
In Amazon ec2, flame.test
domain needs to be added to Route 53 with the minikube IP address,
which can be obtained by running minikube ip
. Without route 53 configuration, the following
ping test will fail.
As a way to test a successful configuration of routing and dns, test with the following commands:
ping -c 1 apiserver.flame.test
ping -c 1 notifier.flame.test
ping -c 1 mlflow.flame.test
These ping commands should run successfully without any error. As another alternative, open a browser and go to mlflow.flame.test
.
That should return a mlflow's web page.
Once using flame is done, one can stop flame by running the following command:
./flame.sh stop
Before starting flame again, make sure that all the pods in the flame namespace are deleted.
To check that, use kubectl get pods -n flame
command.
In kubernetes, a pod is the smallest, most basic deployable object. A pod consists of at least one container instance.
Using the pod's name (e.g., flame-apiserver-65d8c7fcf4-z8x5b
), one can log into the running pod as follows:
kubectl exec -it -n flame flame-apiserver-65d8c7fcf4-z8x5b -- bash
Logs of flame components are found at /var/log/flame
in the instance.
The following command creates config.yaml
under $HOME/.flame
.
./build-config.sh
The flame CLI tool, flamectl
uses the configuration file (config.yaml
) to interact with the flame system.
In order to build flamectl
, run make install
from the level folder (i.e., flame
).
This command compiles source code and installs flamectl
binary as well as other binaries into $HOME/.flame/bin
.
You may want to add export PATH="$HOME/.flame/bin:$PATH"
to your shell config (e.g., ~/.zshrc
, ~/.bashrc
) and then reload your shell config (e.g., source ~/.bashrc
).
The examples in here assume that flamectl
is in $HOME/.flame/bin
and the path ($HOME/.flame/bin
) is exported.
To terminate the fiab environment, run the following:
sudo minikube delete
Note: By executing the above command, any downloaded or locally-built images are also deleted together when the VM is deleted.
Unless a fresh minikube instance is needed, simply stopping the minikube (i.e., sudo minikube stop
) instance would be useful
to save time for development and testing.
In order to run a sample mnist job, refer to instructions at mnist example.