beam-logo

Why We’re Not Using Kubernetes to Scale Our GPU Workloads

author-profile-picEli Mernit
August 19 2023
Engineering

Beam is a function-as-a-service platform that lets developers run their AI apps quickly on the cloud. Users run primarily AI and data workloads on our platform, and we currently expose two autoscaling strategies in our Python SDK.

This article explains how we set up the autoscaling strategies in our serverless system and some of the trade-offs we had to make.

Control Systems 101

At its core, autoscaling is a controls problem. If you’ve worked with physical systems before, you may have built a controller to manage some part of the system.

For example, maybe you built a temperature controller. The controller has a loop which compares the sensor’s (a thermometer) current temperature to the desired temperature (set point) and makes adjustments accordingly. In the case of a temperature controller, you might use a PID, which is just one type of controller. The PID takes into account the sensor data, performs some operations on the data (the transfer function), and produces some output – which is then fed back to the input. In the case of a temperature controller, the input might be the signal being applied to some HVAC system.

An autoscaler is also a controller. In the world of autoscaling serverless workloads, we can define a transfer function which makes adjustments to the system based on a vector of sensor data. In this context that vector is essentially a list of metrics we’re collecting about the workloads. This includes things like average and peak task duration, queue depth, current number of replicas, max replicas, etc.

K8s Pod Autoscaling Strategies

When we first set up our system, we tried using various Kubernetes pod autoscaling strategies. Pod autoscaling requires node autoscaling to be set up, using a framework like Karpenter, Keda, or Cluster Autoscaler.

Pod autoscaling can happen vertically, horizontally, or based on the number of requests.

  • Horizontal pod autoscaling. The way this works is you set a CPU and memory threshold, and pods are added accordingly. It’s very batteries-included, and it’s easy to set up because it’s just an HPA resource. This is good, but the main downside is that you need to set up Kubernetes. You also need to set up some alerting system in your app so you know when memory is going above an amount so autoscaling happens.
  • Vertical pod autoscaling. This works by evaluating the CPU and memory requirements of each pod and scaling the pods dynamically. But it’s optimized for homogenous workloads, and it is experimental, so we didn’t end up using this.

We quickly realized neither of these approaches would work for us, since our system is serverless by default, which means our workloads need to scale to zero. With traditional Kubernetes-based autoscaling, scale to zero isn’t possible because the minimum number of replicas is 1. [1] You can work around this by setting the number of replicas in your deployment to zero, but that’s not an ideal solution.

We then explored Knative, which implements another form of autoscaling called request based autoscaling:

  • Request based autoscaling. Autoscaling is based on how many requests are in flight. This data is captured in a moving window, and the number of replicas are increased accordingly. It’s a form of HPA, but supports scale to zero. It lets you say, ok, this is how many concurrent requests can be in flight before adding another replica. But this requires that you know how many requests each replica can handle at a given time.

The challenge of scaling GPU workloads

The problem with the Kubernetes autoscaling approaches is that CPU and memory consumption are only indirect metrics of how an application is performing. If you’re scaling a regular backend API, or internal service where CPU and memory are good metrics from which to know how your application is performing, the above might work well for you.

CPU workloads are relatively easy to scale. You can scale them up by either adding more workers (processes) to the web server hosting your app, or by adding more replicas and scaling them out horizontally.

However, it’s much harder to do the same thing for GPU workloads. There are ways to share a single GPU across multiple workloads, but I’ll leave those out of scope for this article. The safest option for scaling a GPU workload is just adding another GPU.

Consider an ML model. Let’s assume a single GPU can only handle X requests/minute, and we pass that threshold. We then need to tell our autoscaler to add another machine – and once that machine is up, and our container is started, we’re going to have to load model weights from disk, load those weights into RAM, and finally onto the GPU.

To scale a workload that has this kind of startup cost efficiently, we have a few tricks to make this process smoother:

  • Analyze historical traffic and attempt to anticipate when to add replicas before spikes in traffic occur
  • Optimize the startup costs of loading a new workload. This involves low level optimizations for how we load container images and spin up new machines.

With this being said, we decided to implement two autoscaling strategies at Beam: one based on queue size, and the other based on request latency. Behind the scenes, we’re doing some magic to retrieve cached images quickly using a content addressable storage system we built, which makes it fast to load a container on a fresh replica.

But from an end user perspective, there are two simple levers we’re providing:

Autoscaling based on queue size

The first strategy we implemented is based on queue depth.

A user can define how many tasks they’d like to run on a single replica. For example, if a user specifies a limit of 5 tasks per replica and if you have 5 requests, we only need 1 replica.

This is fairly simple to implement. We’re simply dividing the queue depth by the tasks per replica number, and getting the output which is an integer, and taking the minimum value of this number and the max replicas a user wants to run.

We let users control this based on what they know about their application:

Autoscaling based on request latency

This is similar to scaling by queue depth, but more tailored to an individual use case.

Instead of saying we want a max number of these tasks, we say what’s the max amount of time you want a request to take.

We give people the ability of setting latency of 30s and max replicas to run:

Conclusion

All workloads are different, and there’s no one-size fits all strategy for autoscaling.

While we initially tried Kubernetes-based autoscaling for our system, we realized that CPU and memory-based autoscaling strategies didn’t take into account the actual behavior of an application.

So far, request latency autoscaling has worked fairly well across different use cases. At the end of the day, our users aren’t paying attention to how much CPU or memory their application is using. Instead, they’re thinking about whether requests are getting dropped, and also how long their end-users are waiting for a response from our APIs.

Request latency autoscaling makes it easy to tie autoscaling behavior very closely to the end-user experience.

This being said, we’ve only scratched the surface of possible autoscaling strategies. We plan to expose more in the future.

If anyone has ideas we’d love to hear about them. You can email us at founders [at] beam [dot] cloud.

Appendix

[1] Technically it is possible, but typically not with managed services like EKS or GKE. It may be possible with GKE, as of k8s 1.22.

There is also an experimental feature gate called HPAScaleToZero that's been available since k8s 1.16. When you enable that, you also have to use external metrics not related to the pod of the deployment to scale the deployment. More info here.

Keep Reading

card-cover-image
Tutorials
Engineering

Best Stateful Sandboxes for Code Execution in 2026

Compare stateful code execution sandboxes for AI agents. Explore isolation, persistence, and GPU support to find the best runtime for your agents.

author-profile-picNathanael Chiang
card-cover-image
Tutorials
Engineering

Best Code Execution Environments for AI Agents in 2026

Compare the five best code execution environments for AI agents in 2026 — Beam, E2B, Modal, CodeSandbox, and Daytona — across isolation model, GPU access, cold-start latency, deployment flexibility, and price.

author-profile-picEli Mernit

Ship your app in minutes

Get started with $30 of free credit, refreshed every month