Fast Text-to-Speech Inference with Parler TTS
Mia Gouffray
Parler TTS is a lightweight model that generates high quality, natural sounding audio from your text. With Parler, you have full control over how your audio sounds, allowing you to customize tone, pace, style and more to match your needs perfectly.
Based on the research and work of Dan Lyth and Simon King, Parler TTS is a fully open-source release, providing public access to all datasets, preprocessing tools, training code, and model weights under a permissive license. This allows you to modify the existing model, enabling you to develop a custom solution for your specific needs.
There are two main models, Parler Mini and Large; both models are trained on 45k hours of annotated audio. Mini is faster but less accurate whereas Large requires more time but provides better results. For more information on the models, checkout the Hugging Face documentation.
How to use Parler TTS
Now that we’ve explored what Parler TTS is and what it offers, let’s look at how you can set it up and start integrating it into your own projects. To set up Beam you can follow this guide.
1. Install the Parler TTS package and import necessary modules
2. Load your model
3. Create your Beam endpoint
4. Create a containerized Python environment (using Image) for Parler-TTS from Hugging Face
5. Create a function to generate you TTS output using the Parler model
6. Deploy your api to Beam in the command line
7. Send a post request to your endpoint with your prompt and description
The generated audio file will be returned:
For more information about setting up Parler TTS with Beam, checkout this guide we have created. Now that we’ve looked at how to set up Parler TTS, we can look further into how you can customize the model to meet your needs.
Customizing Parler TTS
A key feature of Parler is its ability to let users customize the speaker, tone, and pace to suit their needs. You can guide the model to generate speech that closely resembles a specific speaker by including the speaker’s name or a descriptive phrase in the input. The model has 34 predefined speaker options and allows further customization by adjusting speech style through descriptive prompts. For example, you could specify that you want:
“A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."
The model will then generate an audio file that accurately reflects the description you’ve provided.
Optimizing Parler TTS Performance
To ensure faster and more efficient text-to-speech generation, checkout the inference guide with key optimization techniques. There are a few different ways to speed up the TTS generation; methods like SDPA, torch.compile, streaming, and batching can significantly improve the model’s speed. Running the model on CPU, can require a considerable amount of time, thus a GPU is recommended. Consider using Google Colab or a Kaggle notebook to run the model.
Tips for Generating Clear and Natural Speech
- Provide a clear description of the desired speaking style including pace, tone, and gender
- Review the different speakers and select one that aligns with your goals
- Use terms like "very clear audio" for high quality results with no background noise or “very noisy” to add background noise into your audio
- Utilize punctuation to allow the model to create natural pauses and improve the rhythm of the speech
- Experiment with different descriptions to find the create the best results
Conclusion
Now that you’re familiar with Parler TTS—how it works, how to use it, and its key features—you’re ready to set it up and run it as a serverless API on Beam. The provided guide offers a step-by-step tutorial on utilizing Parler as your TTS tool.
If you'd like to get started using Parler TTS on Beam, you can sign up and get started today!




