GPUs and Kubernetes for deep learning — Part 1/3

Ubuntu & Kubernetes

A few weeks ago I shared a side project about Building a DYI GPU cluster for k8s to play with Kubernetes with a proper ROI vs. AWS g2 instances.

This was spectacularly interesting when AWS was lagging behind with old nVidia K20s cards (which are not supported anymore on the latest drivers). But with the addition of the P series (p2.xlarge, 8xlarge and 16xlarge) the new cards are K80s with 12GB RAM, outrageously more powerful than the previous ones.

Baidu just released a post on the Kubernetes blog about the PaddlePaddle setup, but they only focused on CPUs. I thought it would be interesting looking at a setup of Kubernetes on AWS adding some GPU nodes, then exercise a Deep Learning framework on it. The docs say it is possible…

This post is the first of a sequence of 3: Setup the GPU cluster (this blog), Adding Storage to a Kubernetes Cluster (right afterwards), and finally run a Deep Learning training on the cluster (working on it, coming up post MWC…).

The plan

In this blog, we will:

  1. Deploy k8s on AWS in a development mode (no HA, colocating etcd, the control plane and PKI)
  2. Deploy 2x nodes with GPUs (p2.xlarge and p2.8xlarge instances)
  3. Deploy 3x nodes with CPU only (m4.xlarge)
  4. Validate GPU availability

Requirements

For what follows, it is important that:

  • You understand Kubernetes 101
  • You have admin credentials for AWS
  • If you followed the other posts, you know we’ll be using the Canonical Distribution of Kubernetes, hence some knowledge about Ubuntu, Juju and the rest of Canonical’s ecosystem will help.

Foreplay

  • Make sure you have Juju installed.

On Ubuntu,

sudo apt-add-repository ppa:juju/stable
sudo apt update
sudo apt install -yqq juju

for other OSes, lookup the official docs

Then to connect to the AWS cloud with your credentials, read this page

  • Finally copy this repo to have access to all the sources
git clone https://github.com/madeden/blogposts ./
cd blogposts/k8s-gpu-cloud

OK! Let’s start GPU-izing the world!

Deploying the cluster

Bootstrap

As usual start with the bootstrap sequence. Just be careful that p2 instances are only available in us-west-2, us-east-1 and eu-west-2 as well as the us-gov regions. I have experienced issues running p2 instances on the EU side hence I recommend using a US region.

juju bootstrap aws/us-east-1 — credential canonical — constraints “cores=4 mem=16G root-disk=64G” 
# Creating Juju controller “aws-us-east-1” on aws/us-east-1
# Looking for packaged Juju agent version 2.1-rc1 for amd64
# Launching controller instance(s) on aws/us-east-1…
# — i-0d48b2c872d579818 (arch=amd64 mem=16G cores=4)
# Fetching Juju GUI 2.3.0
# Waiting for address
# Attempting to connect to 54.174.129.155:22
# Attempting to connect to 172.31.15.3:22
# Logging to /var/log/cloud-init-output.log on the bootstrap machine
# Running apt-get update
# Running apt-get upgrade
# Installing curl, cpu-checker, bridge-utils, cloud-utils, tmux
# Fetching Juju agent version 2.1-rc1 for amd64
# Installing Juju machine agent
# Starting Juju machine agent (service jujud-machine-0)
# Bootstrap agent now started
# Contacting Juju controller at 172.31.15.3 to verify accessibility…
# Bootstrap complete, “aws-us-east-1” controller now available.
# Controller machines are in the “controller” model.
# Initial model “default” added.

Deploying instances

Once the controller is ready we can start deploying services. In my previous posts, I used bundles which are shortcuts to deploy complex apps.

If you are already familiar with Juju you can run juju deploy src/k8s-gpu.yaml and jump to the end of this section. For the others interested in getting into the details, this time we will deploy manually, and go through the logic of the deployment.

Kubernetes is made up of 5 individual applications: Master, Worker, Flannel (network), etcd (cluster state storage DB) and easyRSA (PKI to encrypt communication and provide x509 certs).

In Juju, each app is modeled by a charm, which is a recipe for how to deploy it.

At deployment time, you can give constraints to Juju, either very specific (instance type) or laxist (# of cores). With the latter, Juju will elect the cheapest instance matching your constraints on the target cloud.

First thing to do, is deploy the applications:

juju deploy cs:~containers/kubernetes-master-11 --constraints "cores=4 mem=8G root-disk=32G"
# Located charm "cs:~containers/kubernetes-master-11".
# Deploying charm "cs:~containers/kubernetes-master-11".
juju deploy cs:~containers/etcd-23 --to 0
# Located charm "cs:~containers/etcd-23".
# Deploying charm "cs:~containers/etcd-23".
juju deploy cs:~containers/easyrsa-6 --to lxd:0
# Located charm "cs:~containers/easyrsa-6".
# Deploying charm "cs:~containers/easyrsa-6".
juju deploy cs:~containers/flannel-10
# Located charm "cs:~containers/flannel-10".
# Deploying charm "cs:~containers/flannel-10".
juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=p2.xlarge" kubernetes-worker-gpu
# Located charm "cs:~containers/kubernetes-worker-13".
# Deploying charm "cs:~containers/kubernetes-worker-13".
juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=p2.8xlarge" kubernetes-worker-gpu8
# Located charm "cs:~containers/kubernetes-worker-13".
# Deploying charm "cs:~containers/kubernetes-worker-13".
juju deploy cs:~containers/kubernetes-worker-13 --constraints "instance-type=m4.2xlarge" -n3 kubernetes-worker-cpu
# Located charm "cs:~containers/kubernetes-worker-13".
# Deploying charm "cs:~containers/kubernetes-worker-13".

Here you can see an interesting property in Juju that we never approached before: naming the services you deploy. We deployed the same kubernetes-worker charm twice, but twice with GPUs and the other without. This gives us a way to group instances of a certain type, at the cost of duplicating some commands.

Also note the revision numbers in the charms we deploy. Revisions are not directly tight to versions of the software they deploy. If you omit them, Juju will elect the latest revision, like Docker would do on images.

Adding the relations & exposing software

Now that the applications are deployed, we need to tell Juju how they are related. For example, the Kubernetes master needs certificates to secure its API. Therefore, there is a relation between the kubernetes-master:certificates and easyrsa:client.

This relation means that once the 2 applications are connected, some scripts will run to query the EasyRSA API to create the required certificates, then copy them in the right location on the k8s master.

These relations then create statuses in the cluster, to which charms can react.

Essentially, very high level, think Juju as a pub-sub implementation of application deployment. Every action inside or outside of the cluster posts a message to a common bus, and charms can react to these and perform additional actions, modifying the overall state… and so on and so on until equilibrium is reached.

Let’s add the relations:

juju add-relation kubernetes-master:certificates easyrsa:client
juju add-relation etcd:certificates easyrsa:client
juju add-relation kubernetes-master:etcd etcd:db
juju add-relation flannel:etcd etcd:db
juju add-relation flannel:cni kubernetes-master:cni
for TYPE in cpu gpu gpu8
do 
 juju add-relation kubernetes-worker-${TYPE}:kube-api-endpoint kubernetes-master:kube-api-endpoint
 juju add-relation kubernetes-master:cluster-dns kubernetes-worker-${TYPE}:kube-dns
 juju add-relation kubernetes-worker-${TYPE}:certificates easyrsa:client
 juju add-relation flannel:cni kubernetes-worker-${TYPE}:cni
 juju expose kubernetes-worker-${TYPE}
done
juju expose kubernetes-master

Note at the end the expose commands.

These are instructions for Juju to open up a firewall in the cloud for specific ports of the instances. Some are predefined in charms (Kubernetes Master API is 6443, Workers open up 80 and 443 for ingresses) but you can also force them if you need (for example, when you manually add stuff in the instances post deployment).

Adding CUDA

CUDA does not have an official charm yet (coming up very soon!!), but there is my demoware implementation which you can find on GitHub. It has been updated for this post to CUDA 8.0.61 and drivers 375.26.

Make sure you have the charm tools available, clone and build the CUDA charm:

sudo apt install charm charm-tools
# Exporting the ENV
mkdir -p ~/charms ~/charms/layers ~/charms/interfaces
export JUJU_REPOSITORY=${HOME}/charms
export LAYER_PATH=${JUJU_REPOSITORY}/layers
export INTERFACE_PATH=${JUJU_REPOSITORY}/interfaces
# Build the charm
cd ${LAYER_PATH}
git clone https://github.com/SaMnCo/layer-nvidia-cuda cuda
charm build cuda

This will create a new folder called builds in JUJU_REPOSITORY, and another called cuda in there.

Now you can deploy the charm

juju deploy --series xenial $HOME/charms/builds/cuda
juju add-relation cuda kubernetes-worker-gpu
juju add-relation cuda kubernetes-worker-gpu8

This will take a fair amount of time as CUDA is very long to install (CDK takes about 10min and just CUDA probably 15min).
Nevertheless, at the end the status should show:

juju status
Model    Controller     Cloud/Region   Version
default  aws-us-east-1  aws/us-east-1  2.1-rc1
App                     Version  Status       Scale  Charm              Store       Rev  OS      Notes
cuda                             active           2  cuda               local         2  ubuntu  
easyrsa                 3.0.1    active           1  easyrsa            jujucharms    6  ubuntu  
etcd                    2.2.5    active           1  etcd               jujucharms   23  ubuntu  
flannel                 0.7.0    active           6  flannel            jujucharms   10  ubuntu  
kubernetes-master       1.5.2    active           1  kubernetes-master  jujucharms   11  ubuntu  exposed
kubernetes-worker-cpu   1.5.2    active           3  kubernetes-worker  jujucharms   13  ubuntu  exposed
kubernetes-worker-gpu   1.5.2    active           1  kubernetes-worker  jujucharms   13  ubuntu  exposed
kubernetes-worker-gpu8  1.5.2    active           1  kubernetes-worker  jujucharms   13  ubuntu  exposed
Unit                       Workload     Agent      Machine  Public address  Ports           Message
easyrsa/0*                 active       idle       0/lxd/0  10.0.0.122                      Certificate Authority connected.
etcd/0*                    active       idle       0        54.242.44.224   2379/tcp        Healthy with 1 known peers.
kubernetes-master/0*       active       idle       0        54.242.44.224   6443/tcp        Kubernetes master running.
  flannel/0*               active       idle                54.242.44.224                   Flannel subnet 10.1.76.1/24
kubernetes-worker-cpu/0    active       idle       4        52.86.161.22    80/tcp,443/tcp  Kubernetes worker running.
  flannel/4                active       idle                52.86.161.22                    Flannel subnet 10.1.79.1/24
kubernetes-worker-cpu/1*   active       idle       5        52.70.5.49      80/tcp,443/tcp  Kubernetes worker running.
  flannel/2                active       idle                52.70.5.49                      Flannel subnet 10.1.63.1/24
kubernetes-worker-cpu/2    active       idle       6        174.129.164.95  80/tcp,443/tcp  Kubernetes worker running.
  flannel/3                active       idle                174.129.164.95                  Flannel subnet 10.1.22.1/24
kubernetes-worker-gpu8/0*  active       idle       3        52.90.163.167   80/tcp,443/tcp  Kubernetes worker running.
  cuda/1                   active       idle                52.90.163.167                   CUDA installed and available
  flannel/5                active       idle                52.90.163.167                   Flannel subnet 10.1.35.1/24
kubernetes-worker-gpu/0*   active       idle       1        52.90.29.98     80/tcp,443/tcp  Kubernetes worker running.
  cuda/0*                  active       idle                52.90.29.98                     CUDA installed and available
  flannel/1                active       idle                52.90.29.98                     Flannel subnet 10.1.58.1/24
Machine  State    DNS             Inst id              Series  AZ
0        started  54.242.44.224   i-09ea4f951f651687f  xenial  us-east-1a
0/lxd/0  started  10.0.0.122      juju-65a910-0-lxd-0  xenial  
1        started  52.90.29.98     i-03c3e35c2e8595491  xenial  us-east-1c
3        started  52.90.163.167   i-0ca0716985645d3f2  xenial  us-east-1d
4        started  52.86.161.22    i-02de3aa8efcd52366  xenial  us-east-1e
5        started  52.70.5.49      i-092ac5367e31188bb  xenial  us-east-1a
6        started  174.129.164.95  i-0a0718343068a5c94  xenial  us-east-1c
Relation      Provides                Consumes                Type
juju-info     cuda                    kubernetes-worker-gpu   regular
juju-info     cuda                    kubernetes-worker-gpu8  regular
certificates  easyrsa                 etcd                    regular
certificates  easyrsa                 kubernetes-master       regular
certificates  easyrsa                 kubernetes-worker-cpu   regular
certificates  easyrsa                 kubernetes-worker-gpu   regular
certificates  easyrsa                 kubernetes-worker-gpu8  regular
cluster       etcd                    etcd                    peer
etcd          etcd                    flannel                 regular
etcd          etcd                    kubernetes-master       regular
cni           flannel                 kubernetes-master       regular
cni           flannel                 kubernetes-worker-cpu   regular
cni           flannel                 kubernetes-worker-gpu   regular
cni           flannel                 kubernetes-worker-gpu8  regular
cni           kubernetes-master       flannel                 subordinate
kube-dns      kubernetes-master       kubernetes-worker-cpu   regular
kube-dns      kubernetes-master       kubernetes-worker-gpu   regular
kube-dns      kubernetes-master       kubernetes-worker-gpu8  regular
cni           kubernetes-worker-cpu   flannel                 subordinate
juju-info     kubernetes-worker-gpu   cuda                    subordinate
cni           kubernetes-worker-gpu   flannel                 subordinate
juju-info     kubernetes-worker-gpu8  cuda                    subordinate
cni           kubernetes-worker-gpu8  flannel                 subordinate

Let us see what nvidia-smi gives us:

juju ssh kubernetes-worker-gpu/0 sudo nvidia-smi
Tue Feb 14 13:28:42 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   33C    P0    81W / 149W |      0MiB / 11439MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

On the more powerful 8xlarge,

juju ssh kubernetes-worker-gpu8/0 sudo nvidia-smi
Tue Feb 14 13:59:24 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:00:17.0     Off |                    0 |
| N/A   41C    P8    31W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:00:18.0     Off |                    0 |
| N/A   36C    P0    70W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:00:19.0     Off |                    0 |
| N/A   44C    P0    57W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:00:1A.0     Off |                    0 |
| N/A   38C    P0    70W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           On   | 0000:00:1B.0     Off |                    0 |
| N/A   43C    P0    57W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           On   | 0000:00:1C.0     Off |                    0 |
| N/A   38C    P0    69W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           On   | 0000:00:1D.0     Off |                    0 |
| N/A   44C    P0    58W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           On   | 0000:00:1E.0     Off |                    0 |
| N/A   38C    P0    71W / 149W |      0MiB / 11439MiB |     39%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Aaaand yes!! We have our 8 GPUs as expected so 8x 12GB = 96GB Video RAM!

At this stage, we only have them enabled on the hosts. Now let us add GPU support in Kubernetes.

Adding GPU support in Kubernetes

By default, CDK will not activate GPUs when starting the API server and the Kubelets. We need to do that manually (for now).

Master update

On the master node, update /etc/default/kube-apiserver to add:

# Security Context
KUBE_ALLOW_PRIV="--allow-privileged=true"
before restarting the API Server. This can be done programmatically with:
juju show-status kubernetes-master --format json | \
    jq --raw-output '.applications."kubernetes-master".units | keys[]' | \
    xargs -I UNIT juju ssh UNIT "echo -e '\n# Security Context \nKUBE_ALLOW_PRIV=\"--allow-privileged=true\"' | sudo tee -a /etc/default/kube-apiserver && sudo systemctl restart kube-apiserver.service"

So now the Kube API will accept requests to run privileged containers, which are required for GPU workloads.

Worker nodes

On every worker, /etc/default/kubelet to add the GPU tag, so it looks like:

# Security Context
KUBE_ALLOW_PRIV="--allow-privileged=true"
# Add your own!
KUBELET_ARGS="--experimental-nvidia-gpus=1 --require-kubeconfig --kubeconfig=/srv/kubernetes/config --cluster-dns=10.1.0.10 --cluster-domain=cluster.local"

before restarting the service.

This can be done with

for WORKER_TYPE in gpu gpu8
do
    juju show-status kubernetes-worker-${WORKER_TYPE} --format json | \
        jq --raw-output '.applications."kubernetes-worker-'${WORKER_TYPE}'".units | keys[]' | \
        xargs -I UNIT juju ssh UNIT "echo -e '\n# Security Context \nKUBE_ALLOW_PRIV=\"--allow-privileged=true\"' | sudo tee -a /etc/default/kubelet"
juju show-status kubernetes-worker-${WORKER_TYPE} --format json | \
    jq --raw-output '.applications."kubernetes-worker-'${WORKER_TYPE}'".units | keys[]' | \
    xargs -I UNIT juju ssh UNIT "sudo sed -i 's/KUBELET_ARGS=\"/KUBELET_ARGS=\"--experimental-nvidia-gpus=1\ /' /etc/default/kubelet && sudo systemctl restart kubelet.service"

done

Testing our setup

Now we want to know if the cluster actually has GPU enabled. To validate, run a job with an nvidia-smi pod:

kubectl create -f src/nvidia-smi.yaml
Then wait a little bit and run the log command:
kubectl logs $(kubectl get pods -l name=nvidia-smi -o=name -a)
Tue Feb 14 14:14:57 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:17.0     Off |                    0 |
| N/A   47C    P0    56W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 0000:00:18.0     Off |                    0 |
| N/A   39C    P0    70W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 0000:00:19.0     Off |                    0 |
| N/A   48C    P0    57W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 0000:00:1A.0     Off |                    0 |
| N/A   41C    P0    70W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   47C    P0    58W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   40C    P0    69W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   48C    P0    59W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   41C    P0    72W / 149W |      0MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Ẁhat is intersting here is that the pod sees all the cards, even if we only shared the /dev/nvidia0 char device. At runtime, we would have problems.

If you want to run multi GPU containers, you need to share all char devices like we do in the second yaml file (nvidia-smi-8.yaml)

Conclusion

We reached the first milestone of our 3 part journey: the cluster is up & running, GPUs are activated, and Kubernetes will now welcome GPU workloads.

If you are a data scientist or running Kubernetes workloads that could benefit of GPUs, this already gives you an elegant and very fast way of managing your setups. But usually in this context, you also need to have storage available between the instances, whether it is to share the dataset or to exchange results.

Kubernetes offers many options to connect storage. In the second part of the blog, we will see how to automate adding EFS storage to our instances, then put it to good use with some datasets!

In the meantime, feel free to contact me if you have a specific use case in the cloud for this to discuss operational details. I would be happy to help you setup you own GPU cluster and get you started for the science!

Tearing down

Whenever you feel like it, you can tear down this cluster. These instances can be pricey, hence powering them down when you do not use them is not a bad idea.

juju kill-controller aws/us-east-1

This will ask for confirmation then destroy everything… But now, you are just a few commands and a coffee away from rebuilding it, so that is not a problem.

Original article

About the author

Samuel's photo

Samuel Cozannet is a Strategic Program Manager at Canonical, the company behind Ubuntu. He has a passion for DevOps and believes technology can make the world a better place. In his free time you will find Samuel hanging out in Barcelona or maybe just happily hacking his new Ubuntu Snappy devices.

More articles by Samuel

Posted in: