Product

Serving vLLM for LLM Inference

Eli Mernit

December 2, 20242 min read

This is the first feature in our launch week, where we ship a brand new feature every day, five days in a row.

Background

We’re launching a new abstraction for running vLLM apps on Beam.

beam.cloud is a platform for running serverless apps on GPUs in the cloud. We containerize your code, expose it as a web API, and scale it out for you automatically. You don’t need Docker or AWS or any infrastructure setup to use it.

What is vLLM?

Early this year, we started getting a lot of requests from users to run a new inference engine called vLLM. vLLM is an inference engine for LLMs. The library has optimizations to provide higher throughput and faster inference. It also provides a convenient OpenAI-compatible abstraction, which lets you run inference on open-source models using the same OpenAI SDK syntax that many of us know and love.

Conventional Ways of Serving LLMs with vLLM

vLLM is usually run from the command line, or as a Dockerized application. The command line is great for working locally, but it’s a bit hacky to get it running as a cloud endpoint. Dockerizing vLLM can work too, but then you’re faced with setting up a Dockerfile, pushing it to a hosting service, or hosting it yourself on Kubernetes (yikes).

Serving vLLM on Serverless GPUs

We set out to create a stupidly simple way of running serverless vLLM apps on our cloud.

If you’re familiar with Beam, we containerize code and run it on servers on the cloud.

We wanted to minimize the amount of code needed to run OpenAI-compatible vLLM servers, so we built a special wrapper for vLLM into our ASGI web server abstraction.

How it Works

To run a vLLM app, simply specify the name of your vLLM model, add the GPU you want, and we’ll give you a serverless REST API with SSL, autoscaling, and authentication built-in.

Here’s the Python:

You can swap out the model name in the code above to run any of the many supported vLLM models.

And here’s the command to deploy the model as a serverless API:

In addition to running vLLM as a serverless API, you'll get a few free benefits from running this on Beam:

API Versioning. Each deployment is incremented to a new version, but you can specify specific versions in your URL too.
Task Management. Your endpoint will automatically process tasks in a queue, and you'll get APIs to query the task status, cancel tasks, and view logs from each container running your task.
Autoscaling. If your app starts getting a ton of traffic, your API can automatically scale out to run on hundreds of servers.
File Caching at The Edge™. Beam includes distributed storage volumes for caching all your model weights and large files on S3-backed blobcaches geographically close to the servers running your code.

To get started, create a Beam account and download the vLLM template here. If you have any feedback on the workflow or feature requests, we’d love to hear from you in our Slack Community.

This is Launch 1/5 this week! You can follow along with our upcoming launches on Twitter.

Eli Mernit

Published December 2, 2024

Serving vLLM for LLM Inference

Background

What is vLLM?

Conventional Ways of Serving LLMs with vLLM

Serving vLLM on Serverless GPUs

How it Works

More from the Beam blog

Best Sandbox Providers for Reinforcement Learning in 2026

Top Heroku Alternatives

Start shipping on infra
you won’t outgrow.

Serving vLLM for LLM Inference

Background

What is vLLM?

Conventional Ways of Serving LLMs with vLLM

Serving vLLM on Serverless GPUs

How it Works

More from the Beam blog

Best Sandbox Providers for Reinforcement Learning in 2026

Top Heroku Alternatives

Start shipping on infrayou won’t outgrow.

Start shipping on infra
you won’t outgrow.