Serving vLLM for LLM Inference
Eli Mernit
This is the first feature in our launch week, where we ship a brand new feature every day, five days in a row.
Background
We’re launching a new abstraction for running vLLM apps on Beam.
beam.cloud is a platform for running serverless apps on GPUs in the cloud. We containerize your code, expose it as a web API, and scale it out for you automatically. You don’t need Docker or AWS or any infrastructure setup to use it.
What is vLLM?
Early this year, we started getting a lot of requests from users to run a new inference engine called vLLM. vLLM is an inference engine for LLMs. The library has optimizations to provide higher throughput and faster inference. It also provides a convenient OpenAI-compatible abstraction, which lets you run inference on open-source models using the same OpenAI SDK syntax that many of us know and love.
Conventional Ways of Serving LLMs with vLLM
vLLM is usually run from the command line, or as a Dockerized application. The command line is great for working locally, but it’s a bit hacky to get it running as a cloud endpoint. Dockerizing vLLM can work too, but then you’re faced with setting up a Dockerfile, pushing it to a hosting service, or hosting it yourself on Kubernetes (yikes).
Serving vLLM on Serverless GPUs
We set out to create a stupidly simple way of running serverless vLLM apps on our cloud.
If you’re familiar with Beam, we containerize code and run it on servers on the cloud.
We wanted to minimize the amount of code needed to run OpenAI-compatible vLLM servers, so we built a special wrapper for vLLM into our ASGI web server abstraction.
How it Works
To run a vLLM app, simply specify the name of your vLLM model, add the GPU you want, and we’ll give you a serverless REST API with SSL, autoscaling, and authentication built-in.
Here’s the Python:
You can swap out the model name in the code above to run any of the many supported vLLM models.
And here’s the command to deploy the model as a serverless API:
In addition to running vLLM as a serverless API, you'll get a few free benefits from running this on Beam:
- API Versioning. Each deployment is incremented to a new version, but you can specify specific versions in your URL too.
- Task Management. Your endpoint will automatically process tasks in a queue, and you'll get APIs to query the task status, cancel tasks, and view logs from each container running your task.
- Autoscaling. If your app starts getting a ton of traffic, your API can automatically scale out to run on hundreds of servers.
- File Caching at The Edge™. Beam includes distributed storage volumes for caching all your model weights and large files on S3-backed blobcaches geographically close to the servers running your code.
To get started, create a Beam account and download the vLLM template here. If you have any feedback on the workflow or feature requests, we’d love to hear from you in our Slack Community.
This is Launch 1/5 this week! You can follow along with our upcoming launches on Twitter.
Keep Reading

Top Heroku Alternatives
Modern PaaS solutions similar to Heroku that may better suit your use case.
Samuel Liu
