Kubernetes Cluster Autoscaler n+1 nodes

This is a followup to the first two parts of this series, KEDA & Windows and Kubernetes & Long Running Batch Jobs, where I talk about various ways to autoscale jobs as well as how to speed up node scale-out by using deallocate mode.

This post talks about another technique for improving node scale out time, which is to use a “n + 1” approach; eg, have an extra node ready and waiting for work. (sometimes called a surge node.) By always having an extra node around, you can quickly scale pods onto it without waiting for it to boot up and initialize.
Of course, there is a trade-off for this, namely cost: you’re going to be paying for an extra node to sit there and “wait” for work. However, if scale-out time is important to you, this could be a good tool in your toolkit.

Let’s walk through how to achieve this: (Note: I’m basing much of this post on the excellent guidance available at https://pixelrobots.co.uk/2021/03/aks-how-to-over-provision-node-pools/)

Bootstrap Cluster

Let’s start by creating a cluster:

RG=nplusonetest
LOCATION=westus2
CLUSTER_NAME="nplusone"
WORKPOOL_NAME="workpool"

az group create -l $LOCATION -n $RG

az aks create \
 -g $RG \
 -n $CLUSTER_NAME \
 --node-count 1 \
 --node-vm-size Standard_DS3_v2 \
 --generate-ssh-keys \
  --node-osdisk-type Managed \
  --network-plugin azure 

az aks nodepool add --name $WORKPOOL_NAME \
    -g $RG --cluster-name $CLUSTER_NAME \
    --mode user \
    --enable-cluster-autoscaler \
    --min-count 0 \
    --max-count 10 \
    --node-count 1 \
    --node-osdisk-type Managed \
    --scale-down-mode Deallocate \
    --node-vm-size Standard_DS3_v2

az aks get-credentials -g $RG -n $CLUSTER_NAME

Create Priority Class

Next, we’re going to create a new priority class called operprovisioning and give it a priority of -1.

We’re also going to create a deployment that uses this priority class. This deployment needs to be set with a resource request such that when it’s deployed, it will need to be deployed onto an otherwise empty node.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: overprovisioning
value: -1
globalDefault: false
description: "Priority class used by overprovisioning."

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning
  namespace: overprovisioning
spec:
  replicas: 1
  selector:
    matchLabels:
      run: overprovisioning
  template:
    metadata:
      labels:
        run: overprovisioning
    spec:
      nodeSelector:
        agentpool: workpool
      priorityClassName: overprovisioning
      containers:
      - name: reserve-resources
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: "200m"
            memory: "10Gi"  # set this appropriately for your node's sku
kubectl create namespace overprovisioning
kubectl deploy -f overprovision.yaml

Sample application

Now let’s deploy a sample application

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      nodeSelector:
        agentpool: workpool
      containers:
      - image: nginx
        name: nginx
        resources: 
           requests:
              memory: 10Gi  # this is set abnormally high so that we end up with one pad per node
kubectl apply -f nginx.yaml

You’ll see that when you deploy this sample application, it will deploy to the first node in the node pool, forcing the overprovisioning deployment pod onto a new node.

And if you scale out the nginx deployment via kubectl scale replicas=2 deployment nginx, you’ll see the new nginx pod run on the 2nd node, and the overprovisioning pod forced to a 3rd node.