beam-logo

WhisperX Tutorial: Install, Diarization, API Server, and Cloud Deployment

author-profile-picHassaan Qadir
November 17 2024
Product
Tutorials

WhisperX is an advanced speech recognition model that enhances OpenAI's Whisper with features like speaker diarization and word-level timestamp alignment. In this guide, we'll walk through deploying WhisperX on a serverless GPU using Beam. By the end of this tutorial, you'll have a fully functional REST API capable of transcribing and processing audio files with high efficiency.

WhisperX is a speech transcription toolkit built around fast Whisper-style ASR, forced alignment for word-level timestamps, voice activity detection, and optional speaker diarization. It is useful when you need more than a plain transcript: subtitles, speaker labels, timestamps, or a transcription API for production workloads.

This guide covers how WhisperX works, how to install it, how to run transcription and diarization, and how to deploy it as an API on a cloud GPU.

What Is WhisperX?

WhisperX combines several pieces of a transcription pipeline:

  • Batched speech recognition for faster transcription.
  • Voice activity detection to split audio into speech regions.
  • Forced alignment to produce more accurate word-level timestamps.
  • Optional speaker diarization through pyannote models.

This makes WhisperX a good fit for subtitle generation, podcast and meeting transcription, speaker-labeled transcripts, and API-based audio processing.

WhisperX vs Whisper

WhisperX versus Whisper
FeatureWhisperWhisperX
TranscriptionYesYes
Batched inferenceLimitedYes
Word-level timestampsBasic/approximateForced alignment
Speaker diarizationNo built-in diarizationOptional via pyannote
Production APIRequires wrappingRequires wrapping, but better structured outputs

How to Install WhisperX

The simplest install path is:

For GPU inference, make sure your PyTorch installation matches your CUDA version. If you plan to use speaker diarization, you also need a Hugging Face token and access to the relevant pyannote diarization models.

WhisperX Quickstart

A typical WhisperX pipeline loads audio, transcribes it, aligns the output for word-level timestamps, and optionally assigns speaker labels.

Speaker Diarization with WhisperX

Speaker diarization labels who spoke when. In WhisperX, diarization depends on pyannote models from Hugging Face.

  1. Create a Hugging Face account.
  2. Generate a read token.
  3. Accept the model terms for the required pyannote diarization and segmentation models.
  4. Pass the token to WhisperX when running diarization.

If diarization fails with a 403 or access error, the most common cause is that the Hugging Face token is valid but the model terms have not been accepted.

Running WhisperX as an API Server

For production applications, you usually do not want to run WhisperX from the command line for every request. Instead, wrap it in an API that accepts audio files, runs transcription, and returns JSON, SRT, VTT, or another structured output.

  • Uploading or downloading audio files.
  • Loading and reusing transcription models.
  • Optional alignment and diarization.
  • Timeouts for long audio files.
  • GPU memory cleanup between jobs.
  • Returning word-level timestamps and speaker labels.

This is where a cloud deployment becomes useful: the model can stay warm, large audio jobs can run on GPU, and the API can scale separately from your application.

Overview

We will:

  • Set up the cloud GPU environment.
  • Define the compute environment and dependencies.
  • Load the WhisperX model.
  • Create the transcription API.
  • Deploy the API.
  • Invoke the API to transcribe audio files.

Step 1: Setting Up the Cloud GPU Environment

Beam simplifies the deployment of machine learning models by handling the infrastructure and scaling for you. We'll define our compute environment and specify the dependencies required for WhisperX.

Defining the Compute Environment

We start by specifying the runtime environment for WhisperX using Beam's Image class. This allows us to define the Python packages needed for our application.

Explanation:

  • python_packages: Lists the Python packages to be installed in the environment.
  • volume_path: Specifies the path where models will be cached to avoid re-downloading.

Step 2: Loading the WhisperX Model

We use the on_start function to load the WhisperX models before serving any requests. This ensures that the models are loaded once and cached in memory, improving performance.

Explanation:

  • device = "cuda": Specifies that we want to use the GPU for inference.
  • model_name: Specifies the size of the WhisperX model. Larger models generally provide better accuracy at the cost of increased resource usage.
  • download_root: Uses the specified volume path to cache downloaded models.

Step 3: Creating the Transcription API

We define the API endpoint that will handle audio file uploads, transcribe them using WhisperX, and return the transcribed text.

Explanation:

  • @endpoint Decorator: Registers the function as an API endpoint with Beam.
    • name: Specifies the name of the deployment.
    • image: Uses the custom image we defined earlier.
    • cpu, memory, gpu: Allocates resources for the endpoint.
    • volumes: Mounts the volume for cached models.
    • on_start: Specifies the function to run when the endpoint starts.
  • Function Logic:
    • Transcription: Uses the WhisperX model to transcribe the audio.
    • Alignment: Aligns words to get precise timestamps.
    • Speaker Diarization: Optionally labels different speakers in the audio.
    • Preparing Output: Constructs the transcript with speaker labels.

Step 4: Deploying the API

With the code in place, you can deploy the API to Beam's serverless infrastructure.

Deploy the Application

Run the following command to deploy your application:

beam deploy app.py:transcribe_audio

Explanation:

  • beam deploy: Command to deploy the application.
  • app.py:transcribe_audio: Specifies the file and function to deploy.

Beam will handle building the container image, setting up the environment, and exposing the API endpoint.

Step 5: Invoking the API

After deployment, you can invoke the API by sending a POST request with a URL to your audio file.

Example Invocation Using Python

Explanation:

  • Payload: Constructs a JSON payload with the audio URL.
  • Headers:
    • Authorization: Replace AUTH_TOKEN with your actual authentication token provided by Beam.
    • Content-Type: Specifies that we're sending JSON data.
  • Making the Request: Sends a POST request to the API endpoint.
  • Handling the Response: Checks for errors and prints the transcription result.

Conclusion

You've successfully deployed WhisperX as a serverless speech recognition API using Beam. Your API can now handle:

  • Transcription: Converting audio to text.
  • Word Alignment: Providing precise timestamps for each word.
  • Speaker Diarization: Identifying and labeling different speakers in the audio.

Resources

Common WhisperX Errors

CUDA out of memory

Use a smaller model, reduce batch size, switch compute type to int8, or run on a larger GPU.

Diarization returns access errors

Check that your Hugging Face token is set and that you accepted the pyannote model terms.

Alignment fails

Confirm the detected language and alignment model match the audio language. Some languages have better alignment model support than others.

API requests time out

Long audio files can take time to transcribe and align. For production APIs, use background jobs or async processing instead of a short request timeout.

Keep Reading

card-cover-image
Tutorials
Engineering

Best Stateful Sandboxes for Code Execution in 2026

Compare stateful code execution sandboxes for AI agents. Explore isolation, persistence, and GPU support to find the best runtime for your agents.

author-profile-picNathanael Chiang
card-cover-image
Tutorials
Engineering

Best Code Execution Environments for AI Agents in 2026

Compare the five best code execution environments for AI agents in 2026 — Beam, E2B, Modal, CodeSandbox, and Daytona — across isolation model, GPU access, cold-start latency, deployment flexibility, and price.

author-profile-picEli Mernit

Ship your app in minutes

Get started with $30 of free credit, refreshed every month