Serverless Inference
for any model
Deploy high-performance inference endpoints with sub-second cold starts, autoscaling from zero to thousands of GPUs, and pricing that only bills for the compute you use.
Production inference, without the infrastructure
Everything you need to serve models at scale — autoscaling, fast boot times, and a developer experience built for iteration.
From Python script to live endpoint in seconds
Write a Python function
Wrap your inference logic in an @endpoint decorator and declare the GPU, image, and autoscaling config inline.
Deploy with one command
Run beam deploy to ship your endpoint to serverless GPUs. No Dockerfiles, no Kubernetes, no YAML.
Call your endpoint
Hit your endpoint over HTTP with built-in auth, autoscaling, and telemetry handled for you.
Frequently asked questions
What kinds of models can I deploy?
Any model you can run in Python. Beam works great with open-source LLMs via vLLM, custom PyTorch and TensorFlow models, diffusion models, embeddings, and your own fine-tuned weights.
Which GPUs are available?
You can run inference on a range of GPUs including H100s and A100s. Switching hardware is a single line of Python — just change the gpu argument on your endpoint.
How fast are cold starts?
Sub-second for most workloads. Our distributed storage layer, memory snapshotting, and GPU checkpoint restore let containers boot in seconds even with large models loaded into memory.
How does pricing work?
You're billed per-second only while your endpoint is actively running. When traffic stops, your endpoint scales to zero and billing stops with it. Every account gets $30 of free credit each month.
Can I run this on my own infrastructure?
Yes. Beam is 100% open source, so you can self-host on your own hardware or run on our cloud with the same developer experience.