Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GPUs on AWS #604

Merged
merged 3 commits into from
May 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 58 additions & 13 deletions docs/source/04_how_to_guides/7_qhub_gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,19 +12,19 @@ By default the quota to spin up GPUs on GCP is 0. Make sure you have requested
GCP Support to increase quota of allowed GPUs for your billing account to be the
number of GPUs you need access to.

See GCP Pre-requisites here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#requirements
See [GCP Pre-requisites here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#requirements)

Here are the steps you need to follow to get GPUs working with GCP:
Here are the changes needed in your `qhub-config.yml` file to get GPUs working with GCP:

#### 1. Add GPU node group in `qhub-config.yml` file
#### 1. Add GPU node group

Add a node group for GPU instance in the `node_groups` section of `google_cloud_platform` section,
and under the `guest_accelerators` section add the name of the GPU. A comprehensive list of GPU
types can be found in at the Official GCP docs here: https://cloud.google.com/compute/docs/gpus

An example of getting GPUs on GCP:

```yml
```yaml
google_cloud_platform:
project: project-name
region: us-central1
Expand Down Expand Up @@ -52,15 +52,14 @@ except A100 GPUs, which are only supported on *[a2 machine types](https://cloud.

- If you are not using the gcp provider in QHub but are using gcp (let's say deploying
on an existing gcp cluster). You will need to manually install NVIDIA drivers to the
cluster, See documentation for the same here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers

cluster - see [documentation here](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers).

See limitations here: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations
- See [general limitations of GPUs on Google Kubernetes Engine](https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#limitations).


#### 2. Add GPU instance in the JupyterLab profiles

```yml
```yaml
profiles:
jupyterlab:
# ....
Expand All @@ -71,19 +70,65 @@ profiles:
cpu_guarantee: 7.25
mem_limit: 32G
mem_guarantee: 24G
image: quansight/qhub-jupyterlab:v0.3.11
image: quansight/qhub-jupyterlab:v||QHUB_VERSION||
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"cloud.google.com/gke-nodepool": "gpu-tesla-t4"
```

### Amazon Web Services

Here are the changes needed in your `qhub-config.yml` file to get GPUs working with AWS:

#### 1. Add GPU node group

```yaml
amazon_web_services:
region: us-west-2
# ...
gpu-g4:
instance: g4dn.2xlarge # NVIDIA Tesla T4
min_nodes: 1
max_nodes: 5
gpu: true # This marks the given instance type is gpu enabled.
```

#### 2. Add GPU instance in the JupyterLab profiles

```yaml
profiles:
jupyterlab:
# ....
- display_name: GPU Instance
description: Stable environment with 8 cpu / 32 GB ram and 1 Nvidia Tesla T4
kubespawner_override:
cpu_limit: 8
cpu_guarantee: 7.25
mem_limit: 32G
mem_guarantee: 24G
image: quansight/qhub-jupyterlab:v||QHUB_VERSION||
extra_resource_limits:
nvidia.com/gpu: 1
node_selector:
"eks.amazonaws.com/nodegroup": "gpu-g4"
```

Notes:

- If you are not using the AWS provider in QHub but are using the AWS cloud (let's say deploying
on an existing AWS cluster), you will need to manually install NVIDIA drivers to the
cluster. See [documentation here](https://github.com/NVIDIA/k8s-device-plugin).

### DigitalOcean

DigitalOcean does not support GPUs at the time of writing this.

## Create conda environment to take advantage of gpus
### Azure

Azure does support GPUs in Kubernetes, but QHub doesn't currently have official support for this.

## Create conda environment to take advantage of GPUs

First you need to consult the driver version of nvidia being
used. This can easily be checked via the command `nvidia-smi`.
Expand All @@ -101,7 +146,7 @@ Thu May 20 18:05:14 2021
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 32C P8 29W / 149W | 0MiB / 11441MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------++-----------------------------------------------------------------------------+
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
Expand All @@ -118,7 +163,7 @@ including minor version. Also in the near future cuda should have
better [ABI
compatibility](https://docs.nvidia.com/deploy/cuda-compatibility/index.html).

Bellow is an example gpu environment.
Bellow is an example gpu environment:

```yaml
name: gpu-environment
Expand All @@ -132,6 +177,6 @@ dependencies:
- numba
```

We are working hard to make the GPU expeirence on Qhub as streamlined
We are working hard to make the GPU experience on Qhub as streamlined
as possible. There are many small gotchas when working with GPUs and
getting all the drivers installed properly.
1 change: 1 addition & 0 deletions qhub/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ class NodeGroup(Base):
instance: str
min_nodes: int
max_nodes: int
gpu: typing.Optional[bool] = False

class Config:
extra = "allow"
Expand Down
9 changes: 6 additions & 3 deletions qhub/template/cookiecutter.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,17 +122,20 @@
"general": {
"instance": null,
"min_nodes": null,
"max_nodes": null
"max_nodes": null,
"gpu": null
},
"user": {
"instance": null,
"min_nodes": null,
"max_nodes": null
"max_nodes": null,
"gpu": null
},
"worker": {
"instance": null,
"min_nodes": null,
"max_nodes": null
"max_nodes": null,
"gpu": null
}
}
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ module "kubernetes" {
min_size = {{ nodegroup_config.min_nodes }}
desired_size = {{ nodegroup_config.min_nodes }}
max_size = {{ nodegroup_config.max_nodes }}
gpu = {{ "true" if nodegroup_config.gpu is defined and nodegroup_config.gpu else "false"}}
},
{% endfor %}
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,6 @@ locals {
"arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy",
aws_iam_policy.worker_autoscaling.arn
], var.node_group_additional_policies)

gpu_node_group_names = [for node_group in var.node_groups : node_group.name if node_group.gpu == true]
}
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ resource "aws_eks_node_group" "main" {
subnet_ids = var.cluster_subnets

instance_types = [var.node_groups[count.index].instance_type]
ami_type = var.node_groups[count.index].gpu == true ? "AL2_x86_64_GPU" : "AL2_x86_64"

scaling_config {
min_size = var.node_groups[count.index].min_size
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
resource "kubernetes_daemonset" "nvidia_installer" {
count = length(local.gpu_node_group_names) == 0 ? 0 : 1
metadata {
name = "nvidia-device-plugin-daemonset-1.12"
namespace = "kube-system"
}

spec {
selector {
match_labels = {
name = "nvidia-device-plugin-ds"
}
}

template {
metadata {
labels = {
name = "nvidia-device-plugin-ds"
}
}

spec {
affinity {
node_affinity {
required_during_scheduling_ignored_during_execution {
node_selector_term {
match_expressions {
key = "eks.amazonaws.com/nodegroup"
operator = "In"
values = local.gpu_node_group_names
}
}
}
}
}

volume {
name = "device-plugin"

host_path {
path = "/var/lib/kubelet/device-plugins"
}
}

container {
name = "nvidia-device-plugin-ctr"
image = "nvidia/k8s-device-plugin:1.11"

volume_mount {
name = "device-plugin"
mount_path = "/var/lib/kubelet/device-plugins"
}

security_context {
capabilities {
drop = ["ALL"]
}
}
}

toleration {
key = "CriticalAddonsOnly"
operator = "Exists"
}

toleration {
key = "nvidia.com/gpu"
operator = "Exists"
effect = "NoSchedule"
}
}
}

strategy {
type = "RollingUpdate"
}
}
}

Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ variable "node_groups" {
type = list(object({
name = string
instance_type = string
gpu = bool
min_size = number
desired_size = number
max_size = number
Expand Down