Kubernetes & long running batch Jobs

This is a followup to the first part of this series, KEDA & Windows, where I wrote about using Keda to schedule jobs based on work put into an Azure storage queue.

There are some nuances in using Kubernetes to run long running jobs. If you have a long running job, you don’t want the horizontal pod autoscaler or the cluster autoscaler to terminate the job’s pod as part of a balancing or scale down process. To prevent this, there are some configuration settings that you can adjust. In this post, I’ll describe two of them. The first (Pod Distruption Budget) is the most general, but requires the most configuration. The second (CA eviction label) is simple, but only works for jobs.

Setting Pod Disruption Budget

The first method for preventing your pods from being terminated is to set a Pod Disruption Budget. By configuring the PDB minAvailable to be the same as the number of jobs you desire, you can prevent the voluntary eviction of pods by the scaling engine. This is a bit confusing, but let me provide an example:

First, recall the scaledJob manifest from the previous blog post. I’ve added a label app: winworker to the template spec so that it can be used with the PDB. Additionally, I’ve set activeDeadlineSeconds to 2400 seconds (30 min for my job + 10 min extra ‘buffer’ time). If a job runs for longer than this amount of time, Keda will assume it has a problem and kill it off.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: azure-queue-scaledobject-jobs-win
  namespace: default
spec:
  pollingInterval: 30
  maxReplicaCount: 50
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTargetRef:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 2400  # set to expected max runtime + some buffer
    backoffLimit: 6
    template:
      metadata:
        labels:
          app: winworker
      spec:
        nodeSelector:
          kubernetes.io/os: windows
        containers:
        - name: consumer-job
          image: $ACR.azurecr.io/queue-consumer-windows
          resources:
            requests:
              cpu: 100m
              memory: 2000Mi # intentionally set high in order to trigger cluster autoscaler
            limits:
              cpu: 100m
              memory: 2000Mi
          env:
          - name: AzureWebJobsStorage
            valueFrom:
              secretKeyRef:
                name: secrets
                key: AzureWebJobsStorage
          - name: QUEUE_NAME
            value: keda-queue
  triggers:
  - type: azure-queue
    metadata:
      queueName: keda-queue
      queueLength: '1'
      connectionFromEnv: AzureWebJobsStorage

Next, I’ll specify the PDB. Notice that I set minAvailable to 50, which is the same number as maxReplicaCount in the ScaledJob:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ww-pdb
spec:
  minAvailable: 50
  selector:
    matchLabels:
      app: winworker

And, Voila! You will now be able to keep your jobs running for activeDeadlineSeconds!

Preventing the CA from evicting pods

The pod disruption budget is a great general purpose solution that will work with any workload: jobs, deployments,etc. However, if all you want to prevent is pods being evicted when a node is set to drain, (eg, prevent the cluster autoscaler from scaling down a node with a long running job) there is a simpler method you can use.

To achieve this, you can simply set an annotation on your pod as: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

Specifically:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: azure-queue-scaledobject-jobs-win
  namespace: default
spec:
  pollingInterval: 30
  maxReplicaCount: 50
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 1
  jobTargetRef:
    parallelism: 1
    completions: 1
    activeDeadlineSeconds: 2400  # set to expected max runtime + some buffer
    backoffLimit: 6
    template:
      metadata:
        annotations:
          "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"
        labels:
          app: winworker
      spec:
        nodeSelector:
          kubernetes.io/os: windows
        containers:
        - name: consumer-job
          image: $ACR.azurecr.io/queue-consumer-windows
          resources:
            requests:
              cpu: 100m
              memory: 2000Mi # intentionally set high in order to trigger cluster autoscaler
            limits:
              cpu: 100m
              memory: 2000Mi
          env:
          - name: AzureWebJobsStorage
            valueFrom:
              secretKeyRef:
                name: secrets
                key: AzureWebJobsStorage
          - name: QUEUE_NAME
            value: keda-queue
  triggers:
  - type: azure-queue
    metadata:
      queueName: keda-queue
      queueLength: '1'
      connectionFromEnv: AzureWebJobsStorage

And that’s it, no PDB or other configuration needed. (Do be aware of activeDeadlineSeconds, as that will have the same effect in any case)