My Profile Photo

Sheogorath's Blog

Using Kubernetes spare capacity for Pods

on

In Kubernetes you can do a lot of fun little things. One of them is playing with the cluster-autoscaler and creating workloads that only run, when there is spare capacity left in your cluster.

In order to do that, you need a Priority Class with a value lower than -10. The cluster-autoscaler doesn’t trigger a scale-up, when the pending Pod’s priority is below -10.

This allows you to have Pods in Pending state in the background, that just wait for capacity to become available. As soon as there is capacity, they’ll fill up space on the cluster, delaying a scale down but getting some work done without causing addition scale ups.

To make sure these Pods don’t interfere with your existing workloads, use requests and limits that are identical and force these pods into a guaranteed QoS class. This way they always “reserve” the space they can use for themselves.

Keep in mind, if you don’t define any resources, these Pods will be scheduled on any node and not stick around in the background.

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: non-autoscaling
value: -20
globalDefault: false
description: "This priority class will not cause the autoscaler to trigger. Pods will only run on spare capacity."
---
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: no-autoscale
  name: no-autoscale
spec:
  replicas: 3
  selector:
    matchLabels:
      app: no-autoscale
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: no-autoscale
    spec:
      priorityClassName: non-autoscaling
      containers:
      - image: registry.k8s.io/pause:3.10@sha256:ee6521f290b2168b6e0935a181d4cff9be1ac3f505666ef0e3c98fae8199917a
        name: pause
        resources: 
          requests:
            cpu: "1"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "1Gi"

After deploying the manifests above, it’ll look like this. No spare capacity, they all just hang around pending.

$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
no-autoscale-559d54dc6-9m2f7   0/1     Pending   0          75s
no-autoscale-559d54dc6-nvdkh   0/1     Pending   0          75s
no-autoscale-559d54dc6-vrwhm   0/1     Pending   0          75s

When scaling up Pods that do trigger the cluster-autoscaler due to using the default Pod Priority 0, the no-autoscale Pods still keep pending, while the new ones are scheduled as expected:

# autoscaling.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: autoscale
  name: autoscale
spec:
  replicas: 3
  selector:
    matchLabels:
      app: autoscale
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: autoscale
    spec:
      containers:
      - image: registry.k8s.io/pause:3.10@sha256:ee6521f290b2168b6e0935a181d4cff9be1ac3f505666ef0e3c98fae8199917a
        name: pause
        resources: 
          requests:
            cpu: "1"
            memory: "1Gi"
$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
autoscale-59945cd576-9nhgh     1/1     Running   0          30s
autoscale-59945cd576-b5l9f     1/1     Running   0          30s
autoscale-59945cd576-vc2nf     1/1     Running   0          30s
no-autoscale-559d54dc6-bpv49   0/1     Pending   0          110s
no-autoscale-559d54dc6-gdlfz   0/1     Pending   0          110s
no-autoscale-559d54dc6-zbptd   0/1     Pending   0          110s

But after scaling the autoscale deployment down to 2, one no-autoscale pod will be scheduled and executed, as the previous autoscale Pod frees up capacity:

$ kubectl scale deployment autoscale --replicas 2
deployment.apps/autoscale scaled
$ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
autoscale-59945cd576-9nhgh     1/1     Running   0          4m8s
autoscale-59945cd576-b5l9f     1/1     Running   0          4m8s
no-autoscale-559d54dc6-bpv49   0/1     Pending   0          4m8s
no-autoscale-559d54dc6-gdlfz   0/1     Pending   0          4m8s
no-autoscale-559d54dc6-zbptd   1/1     Running   0          4m8s

This Pod will run now, but also prevent a a scale down, so usually you will want spec.activeDeadlineSeconds on these space-capacity using “no-autoscale” Pods. This will ensure that these Pods are terminated at some point. Also usually you don’t want to use Deployments for that, but rather something like a Job.

Another thing to consider is, that the low Pod priority will trigger preemption. Which means that these “spare capacity” Pods will be killed immediately, when their resources are needed for a regular Pod with default priority.

$ kubectl  scale deployment autoscale --replicas 3
deployment.apps/autoscale scaled
$ kubectl get pods
NAME                           READY   STATUS              RESTARTS   AGE
autoscale-59945cd576-9nhgh     1/1     Running             0          7m23s
autoscale-59945cd576-b5l9f     1/1     Running             0          7m23s
autoscale-59945cd576-lrv4x     0/1     ContainerCreating   0          2s
no-autoscale-559d54dc6-bpv49   0/1     Pending             0          7m23s
no-autoscale-559d54dc6-gdlfz   0/1     Pending             0          7m23s
no-autoscale-559d54dc6-xmqpg   0/1     Pending             0          2s

You could prevent this by using non-preempting Pod-Priorities.

Conclusion

With all this done, what can you do with this? I don’t really know. Fiddling around with Pod Priorities and the scheduler is fun and I enjoy doing it.

And maybe you have workloads that benefits from sticking around Pending indefinitely unless there is some spare capacity. Even when you don’t autoscale your cluster it might be useful to fiddle around with Pod Priorities and help stabilising your workloads.

A lot is possible, make something fun out of it.