Engineering

The Magic of Serverless GPU: A Behind the Scenes Look

Eli Mernit

April 4, 20245 min read

The Magic of Serverless GPU: A Behind the Scenes Look

With the growth of generative AI over the past year, developers have flocked to serverless GPU providers to access affordable and on-demand GPU compute. Our company is one such provider of serverless GPU, and we’ve decided to share our learnings.

What are serverless GPUs?
Why don’t public clouds offer serverless GPUs?
The architecture behind serverless GPU

What are serverless GPUs?

Serverless has a few requirements:

The service turns off when you’re not using it
Consumption-based pricing: you only pay for what you use
It is opinionated, such that you don’t need to configure much infrastructure yourself

Over the past year, over a dozen such companies have emerged to fill this niche. It’s a sensible choice – developers are flocking to build AI startups, their products don’t have traction yet, and they want a fast way of prototyping their apps without having to pay $4,000 a month for an A100 on AWS.

But that being said, if the demand for serverless GPUs is real, then why aren’t the major cloud providers offering this service themselves? Why is this niche full of startups?

Why don’t public clouds offer serverless GPU?

In our view, there are two main reasons the major cloud providers don’t offer serverless GPUs: (1) lack of availability and (2) technical constraints.

Probably the most popular serverless products are AWS Lambda and GCP Cloud Run. These products are successful because they sell fractional compute and do a nice job of locking developers into their own ecosystems. Previously developers needed to buy 100% of a virtual machine, even if they were running only 5 minutes of compute, whereas Lambda allows developers to only pay for those 5 minutes, while including additional services like API gateways, load balancers, and routing tables.

However, while CPUs are abundant, GPUs are scarce. Companies have been renting GPUs from hedge funds.

From the beginning, AWS has served as a resource for offloading excess AWS compute to customers. But AWS is investing heavily in AI, and reserving the best GPUs for its own internal projects, like book recommendations and Amazon Bedrock. This leaves few GPUs available to satisfy the fractional on-demand use case of serverless.

Secondly, it’s a matter of technical constraints. AWS Lambda is built on the open-source Firecracker VMs, which currently lack GPU support, and no public plan to support it.

Another reason why cloud providers may not provide serverless GPU is because virtualization technology of CPU compute is old and well established, and safe. GPU virtualization is a technically complex, somewhat new endeavor. There are open source projects to create the equivalent of vCPU for GPU (vGPU), but none of these provide true, isolated virtualization.

Any of these factors might change in the coming months. But regardless, these trends have exacerbated the shortage of GPUs, prevented clouds from offering GPUs in their serverless offerings, and encouraged startups to fill this gap.

Ok, but how does serverless GPU actually work

We began building serverless tools for machine learning before the rise of generative AI, but the past year has taken things to a new level. Our platform is used by hundreds of developers in production, including companies like Stratum AI, Shippabo, and Frase.

In the past year, the term serverless GPU became mainstream. Developers frequently search for Serverless GPU, and compare the cold start times across the different providers.

There are two UX challenges to offering a serverless GPU service: making sure the apps start very quickly from cold, and building a good DX that abstracts away painful parts of DevOps.

Building a reliable serverless provider requires three things:

A cluster of highly available GPU servers
A way of scheduling workloads onto them
A way of loading container images, and models quickly

Access to servers

Despite the name, serverless code still runs on servers, except the management of those servers is hidden from developers.

These servers can be rented from a variety of vendors: public clouds (AWS/GCP), bespoke GPU vendors (Lambda Labs, Runpod), and distributed marketplaces (Vast.ai, Brokker).

Once you’ve gotten hold of some machines, you need a way of connecting them to each other and scheduling workloads on them. Most likely, you'll use Kubernetes to schedule workloads across a pool of GPU nodes, and you might use a framework like KEDA or Karpenter to set up autoscaling.

The next thing you’ll need is a way of loading container images quickly. Since users can run arbitrary Python on our GPUs, there is a wide variety of images that we need to load onto machines.

Images can either be cached directly on nodes, or lazy loaded by layer using something like Stargazer.

Loading images

When we first built Beam, we stored images in NFS and downloaded the full image from NFS. Over the summer, we built a custom image format, as well as a simple content-addressable storage system, which lets us pull down images much faster.

This system allows us to retrieve images based on a hash from a read-only file system. In only 150 lines of code, we’re able to support hundreds of users concurrently pulling images from the cache, while using less than 200Gi of RAM.

For any given image being run on Beam, 80% of the image is being pulled from the cache, which reduces the number of image layers we need to download over the network.

Loading model weights

The last step is loading model weights, which can be optimized by storing weights on the nodes themselves or loaded from a distributed file system (EFS, JuiceFS, there are many options here). In our case, users can upload their weights to persistent volumes backed by NFS, which are mounted directly to the containers running their workloads.

So, those are the key ingredients.

Using this system, it takes us 7-10s to launch a container on a GPU, pull a 10Gb image from our cache, and load that image into GPU memory.

We think this is pretty good. To improve this further, you’re most looking at optimizing things like network bandwidth or disk I/O.

Conclusion

A crucial distinction between serverless GPU providers is whether they’re providing a curated set of pre-selected models, or whether they’re allowing users to run arbitrary workloads.

Many companies have begun offering serverless GPU inference for popular models. Behind the scenes, they’re running a pool of servers for you, with those models already loaded onto the machines.

Since our workloads are arbitrary, we’ve had to do the dirty work of optimizing the process of loading an arbitrary workload onto a GPU in the least amount of time, and we’ve learned a lot in the process.

Hopefully, this post has exposed the variables that are behind each of these providers, and gives you a better sense of how the sausage is made.

Eli Mernit

Published April 4, 2024

The Magic of Serverless GPU: A Behind the Scenes Look

What are serverless GPUs?

Why don’t public clouds offer serverless GPU?

Ok, but how does serverless GPU actually work

Access to servers

Loading images

Loading model weights

Conclusion

More from the Beam blog

Tinker Model Pricing: What Fine-Tuning Costs in 2026

What Is a Container, Really? Five Years of GPU Infrastructure

Start shipping on infra
you won’t outgrow.

The Magic of Serverless GPU: A Behind the Scenes Look

What are serverless GPUs?

Why don’t public clouds offer serverless GPU?

Ok, but how does serverless GPU actually work

Access to servers

Loading images

Loading model weights

Conclusion

More from the Beam blog

Tinker Model Pricing: What Fine-Tuning Costs in 2026

What Is a Container, Really? Five Years of GPU Infrastructure

Start shipping on infrayou won’t outgrow.

Start shipping on infra
you won’t outgrow.