Kubernetes Cluster Autoscaler n+1 nodes
This is a followup to the first two parts of this series, KEDA & Windows and Kubernetes & Long Running Batch Jobs, where I talk about various ways to autoscale jobs as well as how to speed up node scale-out by using deallocate mode.
This post talks about another technique for improving node scale out time, which is to use a “n + 1” approach; eg, have an extra node ready and waiting for work. (sometimes called a surge node.)
By always having an extra node around, you can quickly scale pods onto it without waiting for it to boot up and initialize.
Of course, there is a trade-off for this, namely cost: you’re going to be paying for an extra node to sit there and “wait” for work. However, if scale-out time is important to you, this could be a good tool in your toolkit.
Let’s walk through how to achieve this: (Note: I’m basing much of this post on the excellent guidance available at https://pixelrobots.co.uk/2021/03/aks-how-to-over-provision-node-pools/)
Bootstrap Cluster
Let’s start by creating a cluster:
RG=nplusonetest
LOCATION=westus2
CLUSTER_NAME="nplusone"
WORKPOOL_NAME="workpool"
az group create -l $LOCATION -n $RG
az aks create \
-g $RG \
-n $CLUSTER_NAME \
--node-count 1 \
--node-vm-size Standard_DS3_v2 \
--generate-ssh-keys \
--node-osdisk-type Managed \
--network-plugin azure
az aks nodepool add --name $WORKPOOL_NAME \
-g $RG --cluster-name $CLUSTER_NAME \
--mode user \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 10 \
--node-count 1 \
--node-osdisk-type Managed \
--scale-down-mode Deallocate \
--node-vm-size Standard_DS3_v2
az aks get-credentials -g $RG -n $CLUSTER_NAME
Create Priority Class
Next, we’re going to create a new priority class called operprovisioning
and give it a priority of -1.
We’re also going to create a deployment that uses this priority class. This deployment needs to be set with a resource request such that when it’s deployed, it will need to be deployed onto an otherwise empty node.
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: overprovisioning
value: -1
globalDefault: false
description: "Priority class used by overprovisioning."
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning
namespace: overprovisioning
spec:
replicas: 1
selector:
matchLabels:
run: overprovisioning
template:
metadata:
labels:
run: overprovisioning
spec:
nodeSelector:
agentpool: workpool
priorityClassName: overprovisioning
containers:
- name: reserve-resources
image: k8s.gcr.io/pause
resources:
requests:
cpu: "200m"
memory: "10Gi" # set this appropriately for your node's sku
kubectl create namespace overprovisioning
kubectl deploy -f overprovision.yaml
Sample application
Now let’s deploy a sample application
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
nodeSelector:
agentpool: workpool
containers:
- image: nginx
name: nginx
resources:
requests:
memory: 10Gi # this is set abnormally high so that we end up with one pad per node
kubectl apply -f nginx.yaml
You’ll see that when you deploy this sample application, it will deploy to the first node in the node pool, forcing the overprovisioning deployment pod onto a new node.
And if you scale out the nginx deployment via kubectl scale replicas=2 deployment nginx
, you’ll see the new nginx pod run on the 2nd node, and the overprovisioning pod forced to a 3rd node.