Inference

Serverless Inference

Name: Beam Inference
Brand: Beam

for any model

Deploy high-performance inference endpoints with sub-second cold starts, autoscaling from zero to thousands of GPUs, and pricing that only bills for the compute you use.

Read the inference docs ⟶

inference.py

Production inference, without the infrastructure

Everything you need to serve models at scale — autoscaling, fast boot times, and a developer experience built for iteration.

Sub-second cold starts

A distributed storage layer, memory snapshotting, and GPU checkpoint restore boot your containers in seconds — even with large models loaded.

Scale to zero, burst to thousands

Endpoints autoscale out when traffic spikes and scale all the way down to zero when idle, so you never pay for idle GPUs.

Run any model on any GPU

Bring your own weights or open-source models and run them on H100s, A100s, and more — switch hardware by changing one line of Python.

Multiple workers per container

Load your model once with on_start, then scale vertically by running multiple workers on the same container to maximize GPU utilization.

Streaming & async responses

Stream tokens back as they're generated, or return responses asynchronously with task callbacks for long-running inference.

Only pay for what you use

Billing is per-second and stops the moment your endpoint goes idle, so you only pay while your code is actually running.

From Python script to live endpoint in seconds

Write a Python function

Wrap your inference logic in an @endpoint decorator and declare the GPU, image, and autoscaling config inline.

Deploy with one command

Run beam deploy to ship your endpoint to serverless GPUs. No Dockerfiles, no Kubernetes, no YAML.

Call your endpoint

Hit your endpoint over HTTP with built-in auth, autoscaling, and telemetry handled for you.

terminal

Frequently asked questions

What kinds of models can I deploy?

Any model you can run in Python. Beam works great with open-source LLMs via vLLM, custom PyTorch and TensorFlow models, diffusion models, embeddings, and your own fine-tuned weights.

Which GPUs are available?

You can run inference on a range of GPUs including H100s and A100s. Switching hardware is a single line of Python — just change the gpu argument on your endpoint.

How fast are cold starts?

Sub-second for most workloads. Our distributed storage layer, memory snapshotting, and GPU checkpoint restore let containers boot in seconds even with large models loaded into memory.

How does pricing work?

You're billed per-second only while your endpoint is actively running. When traffic stops, your endpoint scales to zero and billing stops with it. Every account gets $30 of free credit each month.

Can I run this on my own infrastructure?

Yes. Beam is 100% open source, so you can self-host on your own hardware or run on our cloud with the same developer experience.

Ship your app in minutes

Get started with $30 of free credit, refreshed every month.