WhisperX Tutorial: Install, Diarization, API Server, and Cloud Deployment
Hassaan Qadir
WhisperX is an advanced speech recognition model that enhances OpenAI's Whisper with features like speaker diarization and word-level timestamp alignment. In this guide, we'll walk through deploying WhisperX on a serverless GPU using Beam. By the end of this tutorial, you'll have a fully functional REST API capable of transcribing and processing audio files with high efficiency.
WhisperX is a speech transcription toolkit built around fast Whisper-style ASR, forced alignment for word-level timestamps, voice activity detection, and optional speaker diarization. It is useful when you need more than a plain transcript: subtitles, speaker labels, timestamps, or a transcription API for production workloads.
This guide covers how WhisperX works, how to install it, how to run transcription and diarization, and how to deploy it as an API on a cloud GPU.
What Is WhisperX?
WhisperX combines several pieces of a transcription pipeline:
- Batched speech recognition for faster transcription.
- Voice activity detection to split audio into speech regions.
- Forced alignment to produce more accurate word-level timestamps.
- Optional speaker diarization through pyannote models.
This makes WhisperX a good fit for subtitle generation, podcast and meeting transcription, speaker-labeled transcripts, and API-based audio processing.
WhisperX vs Whisper
| Feature | Whisper | WhisperX |
|---|---|---|
| Transcription | Yes | Yes |
| Batched inference | Limited | Yes |
| Word-level timestamps | Basic/approximate | Forced alignment |
| Speaker diarization | No built-in diarization | Optional via pyannote |
| Production API | Requires wrapping | Requires wrapping, but better structured outputs |
How to Install WhisperX
The simplest install path is:
For GPU inference, make sure your PyTorch installation matches your CUDA version. If you plan to use speaker diarization, you also need a Hugging Face token and access to the relevant pyannote diarization models.
WhisperX Quickstart
A typical WhisperX pipeline loads audio, transcribes it, aligns the output for word-level timestamps, and optionally assigns speaker labels.
Speaker Diarization with WhisperX
Speaker diarization labels who spoke when. In WhisperX, diarization depends on pyannote models from Hugging Face.
- Create a Hugging Face account.
- Generate a read token.
- Accept the model terms for the required pyannote diarization and segmentation models.
- Pass the token to WhisperX when running diarization.
If diarization fails with a 403 or access error, the most common cause is that the Hugging Face token is valid but the model terms have not been accepted.
Running WhisperX as an API Server
For production applications, you usually do not want to run WhisperX from the command line for every request. Instead, wrap it in an API that accepts audio files, runs transcription, and returns JSON, SRT, VTT, or another structured output.
- Uploading or downloading audio files.
- Loading and reusing transcription models.
- Optional alignment and diarization.
- Timeouts for long audio files.
- GPU memory cleanup between jobs.
- Returning word-level timestamps and speaker labels.
This is where a cloud deployment becomes useful: the model can stay warm, large audio jobs can run on GPU, and the API can scale separately from your application.
Overview
We will:
- Set up the cloud GPU environment.
- Define the compute environment and dependencies.
- Load the WhisperX model.
- Create the transcription API.
- Deploy the API.
- Invoke the API to transcribe audio files.
Step 1: Setting Up the Cloud GPU Environment
Beam simplifies the deployment of machine learning models by handling the infrastructure and scaling for you. We'll define our compute environment and specify the dependencies required for WhisperX.
Defining the Compute Environment
We start by specifying the runtime environment for WhisperX using Beam's Image class. This allows us to define the Python packages needed for our application.
Explanation:
- python_packages: Lists the Python packages to be installed in the environment.
- volume_path: Specifies the path where models will be cached to avoid re-downloading.
Step 2: Loading the WhisperX Model
We use the on_start function to load the WhisperX models before serving any requests. This ensures that the models are loaded once and cached in memory, improving performance.
Explanation:
- device = "cuda": Specifies that we want to use the GPU for inference.
- model_name: Specifies the size of the WhisperX model. Larger models generally provide better accuracy at the cost of increased resource usage.
- download_root: Uses the specified volume path to cache downloaded models.
Step 3: Creating the Transcription API
We define the API endpoint that will handle audio file uploads, transcribe them using WhisperX, and return the transcribed text.
Explanation:
- @endpoint Decorator: Registers the function as an API endpoint with Beam.
- name: Specifies the name of the deployment.
- image: Uses the custom image we defined earlier.
- cpu, memory, gpu: Allocates resources for the endpoint.
- volumes: Mounts the volume for cached models.
- on_start: Specifies the function to run when the endpoint starts.
- Function Logic:
- Transcription: Uses the WhisperX model to transcribe the audio.
- Alignment: Aligns words to get precise timestamps.
- Speaker Diarization: Optionally labels different speakers in the audio.
- Preparing Output: Constructs the transcript with speaker labels.
Step 4: Deploying the API
With the code in place, you can deploy the API to Beam's serverless infrastructure.
Deploy the Application
Run the following command to deploy your application:
beam deploy app.py:transcribe_audio
Explanation:
- beam deploy: Command to deploy the application.
- app.py:transcribe_audio: Specifies the file and function to deploy.
Beam will handle building the container image, setting up the environment, and exposing the API endpoint.
Step 5: Invoking the API
After deployment, you can invoke the API by sending a POST request with a URL to your audio file.
Example Invocation Using Python
Explanation:
- Payload: Constructs a JSON payload with the audio URL.
- Headers:
- Authorization: Replace AUTH_TOKEN with your actual authentication token provided by Beam.
- Content-Type: Specifies that we're sending JSON data.
- Making the Request: Sends a POST request to the API endpoint.
- Handling the Response: Checks for errors and prints the transcription result.
Conclusion
You've successfully deployed WhisperX as a serverless speech recognition API using Beam. Your API can now handle:
- Transcription: Converting audio to text.
- Word Alignment: Providing precise timestamps for each word.
- Speaker Diarization: Identifying and labeling different speakers in the audio.
Resources
Common WhisperX Errors
CUDA out of memory
Use a smaller model, reduce batch size, switch compute type to int8, or run on a larger GPU.
Diarization returns access errors
Check that your Hugging Face token is set and that you accepted the pyannote model terms.
Alignment fails
Confirm the detected language and alignment model match the audio language. Some languages have better alignment model support than others.
API requests time out
Long audio files can take time to transcribe and align. For production APIs, use background jobs or async processing instead of a short request timeout.
Keep Reading

Best Stateful Sandboxes for Code Execution in 2026
Compare stateful code execution sandboxes for AI agents. Explore isolation, persistence, and GPU support to find the best runtime for your agents.
Nathanael Chiang
Best Code Execution Environments for AI Agents in 2026
Compare the five best code execution environments for AI agents in 2026 — Beam, E2B, Modal, CodeSandbox, and Daytona — across isolation model, GPU access, cold-start latency, deployment flexibility, and price.
Eli Mernit