Autoscaling with Keda

7 min readJan 8, 2022

This article adds the Keda Autoscaler to a complete K3D Kubernetes environment for you to run your own experiments. It comes with a Grafana dashboard to show how autoscaling is doing its thing in real-time.

This series of articles covers a K3D based Kubernetes setup with a lot of instrumentation. Here is what we have so far:

K3D cluster with Ingress-Nginx,
A NodeJS application exporting metrics and logs for Prometheus/Fluentbit and Grafana running locally,
Alertmanager running locally, with Slack alerts,
FluentBit and InfluxDB running locally to process logs,
Grafana Cloud with Loki, Prometheus and Grafana hosted by Grafana as an alternative to the local Grafana instance, and
Goldilocks for resource estimates.

We are adding now Keda for autoscaling of our NodeJS application in the local environment.

Update January 10, 2022: there is a second article about the same source code covering the use of health endpoints with NodeJS/Express.

Update February 13, 2022: There is a newer version of the software here with several updates.

What you need to bring, how to install everything

If you are using Windows: you can use Windows Subsystem for Linux to have a clean installation. A recent Windows 10 or 11 system will do fine, or a Ubuntu-alike Linux machine, better with 8 GB or more RAM. A basic understanding of bash, Docker, Kubernetes, Grafana, and NodeJS is expected. For the full setup you will need a Slack account for receiving alert messages, and a Grafana Cloud account, but these are optional, and disabled by default.

All source code is available at Github via https://github.com/klaushofrichter/keda-scaling. The source code includes bash scripts and NodeJS code. As always, you should inspect code like this before you execute it on your machine, to make sure that no bad things happen.

The Windows Subsystem for Linux setup process is described in a previous article. You should do the same steps for this setup, except using the newer repository. There is one new configuration option in config.sh, enabled by default:

export KEDA_ENABLE="yes"  # or "no"

With KEDA_ENABLE set to yes we are getting Keda deployed in the course of the start.sh script. You can deploy Keda any time later with ./keda-deploy.sh. We are using a helm chart for Keda, you can see details about that in the keda-deploy.sh script.

TL;DR

You may need to change config.sh to enable or disable what you are looking for, but you can also leave everything as it is in the repository to get started. You will need Docker Desktop and WSL on the Windows side (if you use Windows), then run ./setup.sh in a WSL terminal to install the necessary tools on the Linux side. Check out the related article for details. Then call ./start.sh and let it build the cluster, for about 10 minutes. After that, you can check out the Grafana Dashboard to see scaling in action.

Dashboard showing scaling up and average CPU load

In this chart, we generated traffic for a few minutes with ./app-traffic 0 0.01 which produces ongoing API calls with a little bit of delay between calls. There is a preconfigured manifest in app-scaledobject.yaml.template that calls for scaling once 80% CPU load is reached relative to the application's CPU request. The specification is formatted according to a Custom Resource Definition that comes with Keda, see below:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: myapp-cpu-utilization
  namespace: myapp
spec:
  scaleTargetRef:
    name: myapp-deploy
  triggers:
  - type: cpu
    metadata:
      type: Utilization
      value: "80"
  minReplicaCount:  2
  maxReplicaCount:  20

This object specifies that we are scaling up when there is more than 80% CPU load compared to the application CPU request. We are placing the ScaledObject definition in the namespace of the application (myapp) and target a specific deployment (myapp-deploy). The Utilization type needs to be used in order to use a percentage instead of an absolute value. We are asking for a minimum of two replicas and a maximum of 20. More details about the parameters can be checked at Keda’s documentation.

Keda actually uses the Horizontal Pod Autoscaler behind the scenes. You can verify that by checking out the HPA:

$ kubectl get hpa -n myapp --watch

It produces output like this during the “watch” period in the above case:

It is interesting to see how the pods scaling lags the CPU consumption: the HPA has a delay when creating new instances, causing the CPU load to spike to 260% before the new replicas reduce the average load. It shows even more delay when scaling down. That is why we see very low average CPU utilization once the load goes away but pods are still active. After a while, a step-by-step scale-down happens and average load levels come up again.

Lag of scaling down after load reduction

That certainly needs to be understood when you may have very sudden changes in load. The impact this slower adaptation is that some quality of service reduction may happen when the load rises sharply, e.g. response times increase. After the load is gone, there are more instances than needed, causing potentially higher costs of operation than needed in the real world.

You can also look at HPA’s events with this call below to see a bit more of the HPA actions:

$ kubectl describe hpa -n myapp

Part of the output of *kubectl describe hpa -n myapp*

Coming back to the Dashboard shot from the beginning of the article: It can be observed that the application does not require a lot more memory when the load is applied, but the CPU use is quite correlated to the load. This type of behavior is application-dependent, in our case, the application is not doing a lot with memory and just consumes more CPU cycles when called.

Please note that your mileage may vary: That means that you are likely using a different CPU, and you may need to adjust the parameters when creating traffic and/or change the resource requests in app.yaml.template. Also of note is that the HPA and the Dashboard use different metrics sources, so there is a difference in the values and the timing — but in practice, this difference does not change the principle of the scaling.

Why Keda?

As shown before, in order to scale with the CPU consumption, Keda uses the Kubernetes Horizontal Pod Autoscaler. You can use the HPA directly without Keda, so why do this extra component? The call below does the same trick:

$ ./keda-undeploy.sh    # to remove keda
$ kubectl autoscale deployment myapp-deploy -n myapp --cpu-percent=80 --min=2 --max=20

But Keda offers a few more options then CPU/Memory triggered scaling using HPA by supporting many more “scalers”. You can use Prometheus queries, Influxdb queries, or other external sources to scale applications with the same infrastructure. Examples are available, e.g. here using Prometheus or here using RabbitMQ and other queues. For example, one could scale based on API response times, instead of memory consumption, assuming that scaling up would shorten response times.

So while Keda seems too much for a simple purpose, it can provide more options and consistency when using different triggers to control scaling.

Alerts

If you have Slack enabled, you will receive some alerts when scaling the application. The alerts are discussed in this article, and the definitions are in am-values.yaml.template, so you can easily take them out or silence them using the alertmanger when it is annoying.

Also, there is a Keda specific new alert in keda-values.yaml.template. Keda itself is producing metrics, so if things did go well, you should see Keda in the Prometheus target list at http://localhost:8080/prom/targets and alert specifications at http://localhost:8080/prom/alerts. You can customize the alerts in the Keda Helm Values file.

Where to go from here?

This setup is supposed to give you some tools to do your own experiments. If you change the app-scaledobject.yaml.template file, you will need to install it with this call:

$ export APP=myapp
$ cat app-scaledobject.yaml.template | kubectl apply -f -

or redeploy Keda with ./keda-deploy.sh. You can explore other scalers, or experiment more with the resource settings of the app to try other behavior. The resource settings are in app.yaml.template, and one way to get them in is a (brute force) application redeployment: ./app-deploy.sh or use Kubectl patch with this lengthy call to roughly double cpu/memory settings:

$ kubectl patch deployment myapp-deploy -n myapp -p '{"spec":{"template":{"spec":{"containers":[{"name":"myapp-container", "resources":{"limits":{"cpu":"50m","memory":"100Mi"}, "requests":{"cpu":"25m","memory":"50Mi"}}}]}}}}'

Our environment is a single node setup on a single machine, which is pretty limited. But it is quite possible to setup a multi-node system with K3D, and impose also quotas, and then see how the system behaves when applying load. More to explore by experiment…