nebari-dev · aktech · May 26, 2021 · May 25, 2021 · May 25, 2021 · May 26, 2021
@@ -12,19 +12,19 @@ By default the quota to spin up GPUs on GCP is 0. Make sure you have requested
 GCP Support to increase quota of allowed GPUs for your billing account to be the
 number of GPUs you need access to.
 
-See GCP Pre-requisites here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#requirements
+See [GCP Pre-requisites here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#requirements)
 
-Here are the steps you need to follow to get GPUs working with GCP:
+Here are the changes needed in your `qhub-config.yml` file to get GPUs working with GCP:
 
-#### 1. Add GPU node group in `qhub-config.yml` file
+#### 1. Add GPU node group
 
 Add a node group for GPU instance in the `node_groups` section of `google_cloud_platform` section,
 and under the `guest_accelerators` section add the name of the GPU. A comprehensive list of GPU
 types can be found in at the Official GCP docs here: https://cloud.google.com/compute/docs/gpus
 
 An example of getting GPUs on GCP:
 
-```yml
+```yaml
 google_cloud_platform:
   project: project-name
   region: us-central1
@@ -52,15 +52,14 @@ except A100 GPUs, which are only supported on *[a2 machine types](https://cloud.
 
 - If you are not using the gcp provider in QHub but are using gcp (let's say deploying
   on an existing gcp cluster). You will need to manually install NVIDIA drivers to the
-  cluster, See documentation for the same here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers
-
+  cluster - see [documentation here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers).
 
-See limitations here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations
+- See [general limitations of GPUs on Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations).
 
 
 #### 2. Add GPU instance in the JupyterLab profiles
 
-```yml
+```yaml
 profiles:
   jupyterlab:
 # ....
@@ -71,19 +70,65 @@ profiles:
       cpu_guarantee: 7.25
       mem_limit: 32G
       mem_guarantee: 24G
-      image: quansight/qhub-jupyterlab:v0.3.11
+      image: quansight/qhub-jupyterlab:v||QHUB_VERSION||
       extra_resource_limits:
         nvidia.com/gpu: 1
       node_selector:
           "cloud.google.com/gke-nodepool": "gpu-tesla-t4"
 ```
 
+### Amazon Web Services
+
+Here are the changes needed in your `qhub-config.yml` file to get GPUs working with AWS:
+
+#### 1. Add GPU node group
+
+```yaml
+amazon_web_services:
+  region: us-west-2
+#   ...
+    gpu-g4:
+      instance: g4dn.2xlarge     # NVIDIA Tesla T4
+      min_nodes: 1
+      max_nodes: 5
+      gpu: true     # This marks the given instance type is gpu enabled.
+```
+
+#### 2. Add GPU instance in the JupyterLab profiles
+
+```yaml
+profiles:
+  jupyterlab:
+# ....
+    - display_name: GPU Instance
+      description: Stable environment with 8 cpu / 32 GB ram and 1 Nvidia Tesla T4
+      kubespawner_override:
+        cpu_limit: 8
+        cpu_guarantee: 7.25
+        mem_limit: 32G
+        mem_guarantee: 24G
+        image: quansight/qhub-jupyterlab:v||QHUB_VERSION||
+        extra_resource_limits:
+          nvidia.com/gpu: 1
+        node_selector:
+          "eks.amazonaws.com/nodegroup": "gpu-g4"
+```
+
+Notes:
+
+- If you are not using the AWS provider in QHub but are using the AWS cloud (let's say deploying
+  on an existing AWS cluster), you will need to manually install NVIDIA drivers to the
+  cluster. See [documentation here](https://github.com/NVIDIA/k8s-device-plugin).
 
 ### DigitalOcean
 
 DigitalOcean does not support GPUs at the time of writing this.
 
-## Create conda environment to take advantage of gpus
+### Azure
+
+Azure does support GPUs in Kubernetes, but QHub doesn't currently have official support for this.
+
+## Create conda environment to take advantage of GPUs
 
 First you need to consult the driver version of nvidia being
 used. This can easily be checked via the command `nvidia-smi`.
@@ -101,7 +146,7 @@ Thu May 20 18:05:14 2021
 |   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
 | N/A   32C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
 |                               |                      |                  N/A |
-+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+
++-------------------------------+----------------------+----------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
@@ -118,7 +163,7 @@ including minor version. Also in the near future cuda should have
 better [ABI
 compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html).
 
-Bellow is an example gpu environment. 
+Bellow is an example gpu environment:
 
 ```yaml
 name: gpu-environment
@@ -132,6 +177,6 @@ dependencies:
  - numba
 ```
 
-We are working hard to make the GPU expeirence on Qhub as streamlined
+We are working hard to make the GPU experience on Qhub as streamlined
 as possible. There are many small gotchas when working with GPUs and
 getting all the drivers installed properly.
@@ -158,6 +158,7 @@ class NodeGroup(Base):
     instance: str
     min_nodes: int
     max_nodes: int
+    gpu: typing.Optional[bool] = False
 
     class Config:
         extra = "allow"

@@ -122,17 +122,20 @@
             "general": {
                 "instance": null,
                 "min_nodes": null,
-                "max_nodes": null
+                "max_nodes": null,
+                "gpu": null
             },
             "user": {
                 "instance": null,
                 "min_nodes": null,
-                "max_nodes": null
+                "max_nodes": null,
+                "gpu": null
             },
             "worker": {
                 "instance": null,
                 "min_nodes": null,
-                "max_nodes": null
+                "max_nodes": null,
+                "gpu": null
             }
         }
     },

@@ -86,6 +86,7 @@ module "kubernetes" {
       min_size      = {{ nodegroup_config.min_nodes }}
       desired_size  = {{ nodegroup_config.min_nodes }}
       max_size      = {{ nodegroup_config.max_nodes }}
+      gpu           = {{ "true" if nodegroup_config.gpu is defined and nodegroup_config.gpu else "false"}}
     },
 {% endfor %}
   ]

@@ -9,4 +9,6 @@ locals {
     "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
     aws_iam_policy.worker_autoscaling.arn
   ], var.node_group_additional_policies)
+
+  gpu_node_group_names = [for node_group in var.node_groups : node_group.name if node_group.gpu == true]
 }
@@ -24,6 +24,7 @@ resource "aws_eks_node_group" "main" {
   subnet_ids      = var.cluster_subnets
 
   instance_types = [var.node_groups[count.index].instance_type]
+  ami_type       = var.node_groups[count.index].gpu == true ? "AL2_x86_64_GPU" : "AL2_x86_64"
 
   scaling_config {
     min_size     = var.node_groups[count.index].min_size

@@ -0,0 +1,79 @@
+resource "kubernetes_daemonset" "nvidia_installer" {
+  count = length(local.gpu_node_group_names) == 0 ? 0 : 1
+  metadata {
+    name      = "nvidia-device-plugin-daemonset-1.12"
+    namespace = "kube-system"
+  }
+
+  spec {
+    selector {
+      match_labels = {
+        name = "nvidia-device-plugin-ds"
+      }
+    }
+
+    template {
+      metadata {
+        labels = {
+          name = "nvidia-device-plugin-ds"
+        }
+      }
+
+      spec {
+        affinity {
+          node_affinity {
+            required_during_scheduling_ignored_during_execution {
+              node_selector_term {
+                match_expressions {
+                  key      = "eks.amazonaws.com/nodegroup"
+                  operator = "In"
+                  values   = local.gpu_node_group_names
+                }
+              }
+            }
+          }
+        }
+
+        volume {
+          name = "device-plugin"
+
+          host_path {
+            path = "/var/lib/kubelet/device-plugins"
+          }
+        }
+
+        container {
+          name  = "nvidia-device-plugin-ctr"
+          image = "nvidia/k8s-device-plugin:1.11"
+
+          volume_mount {
+            name       = "device-plugin"
+            mount_path = "/var/lib/kubelet/device-plugins"
+          }
+
+          security_context {
+            capabilities {
+              drop = ["ALL"]
+            }
+          }
+        }
+
+        toleration {
+          key      = "CriticalAddonsOnly"
+          operator = "Exists"
+        }
+
+        toleration {
+          key      = "nvidia.com/gpu"
+          operator = "Exists"
+          effect   = "NoSchedule"
+        }
+      }
+    }
+
+    strategy {
+      type = "RollingUpdate"
+    }
+  }
+}
+
@@ -36,6 +36,7 @@ variable "node_groups" {
   type = list(object({
     name          = string
     instance_type = string
+    gpu           = bool
     min_size      = number
     desired_size  = number
     max_size      = number