Creating text-to-speech (TTS) solutions for lengthy content often poses challenges, such as maintaining natural speech quality and managing computational limitations.
In this blog, we’ll explore how to convert long text into high-quality speech using Microsoft’s SpeechT5 model and chunking techniques, ensuring seamless audio synthesis.
Introducing Microsoft’s SpeechT5
Microsoft’s SpeechT5 is a powerful text-to-speech model available on Hugging Face. It supports speaker embeddings for personalized speech synthesis, making it ideal for creating natural and expressive audio outputs. With pre-trained models, you can generate speech without extensive training, saving time and computational resources. TTS solutions enhance AI chatbot development by making interactions more engaging and dynamic.. This guide will use the SpeechT5 model with the Hugging Face Transformers library in a Google Colab environment.
Why Use Chunking for Text-to-Speech?
Chunking is the process of breaking down long text into smaller, manageable pieces. This approach ensures that:
- The model doesn’t hit token length limits.
- Speech output remains natural and intelligible.
- Resource usage is optimized, especially when processing on limited hardware.
We’ll walk you through the Python implementation using the Hugging Face Transformers library and the SpeechT5 model.
Prerequisites
Ensure you have Python installed and set up your environment with the required libraries:
!pip install --upgrade pip !pip install --upgrade transformers sentencepiece datasets !pip install torch !pip install pydub !apt-get install ffmpeg
Step-by-Step Implementation
1. Setting Up the SpeechT5 Pipeline
from transformers import pipeline import torch from datasets import load_dataset import soundfile as sf
# Initialize the text-to-speech pipeline synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts") # Load speaker embedding dataset embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation") speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0) # Generate sample speech speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"speaker_embeddings": speaker_embedding}) sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])
2. Defining the Text and Chunking Function
We use a sample passage to demonstrate splitting text into smaller chunks:
# Full input text text = """Centuries ago, in a small village of Kaladi in Kerala, a prodigious boy named Adi Shankaracharya was born. From a young age, he showed remarkable wisdom and a deep desire for spiritual knowledge. At just eight years old, Adi sought his mother’s permission to become a monk. Despite her hesitation, she eventually blessed him. He set out on a journey across India, walking barefoot to spread the message of Advaita Vedanta—the philosophy of non-duality, teaching that the soul and the divine are one. One day, in Varanasi, he encountered a low-caste man blocking his path. When asked to move, the man replied, “If everything is one, what should move away from what?” Realizing the profound truth in his words, Shankaracharya bowed and wrote his famous hymn, Manisha Panchakam. Through debates with scholars, Shankaracharya revived Sanatana Dharma, unified spiritual practices, and established the four mathas or monastic centers that continue to guide seekers to this day.""" # Function to split text into smaller chunks def split_text(text, max_length=200): sentences = text.split(". ") chunks = [] current_chunk = "" for sentence in sentences: if len(current_chunk) + len(sentence) + 1 <= max_length: current_chunk += sentence + ". " else: chunks.append(current_chunk.strip()) current_chunk = sentence + ". " if current_chunk: chunks.append(current_chunk.strip()) return chunks # Split the text into manageable chunks text_chunks = split_text(text, max_length=200) # Debugging: Print the chunks print(f"Total Chunks: {len(text_chunks)}") for i, chunk in enumerate(text_chunks): print(f"Chunk {i+1}: {chunk}")
3. Generating Speech for Each Chunk
from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan from pydub import AudioSegment # Load models and vocoder processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts") model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts") vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan") # Generate speech for each chunk speech_outputs = [] for idx, chunk in enumerate(text_chunks): print(f"Processing chunk {idx+1}/{len(text_chunks)}: {chunk}") inputs = processor(text=chunk.strip(), return_tensors="pt") speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder) # Save individual chunk audio speech_outputs.append(speech.numpy()) sf.write(f"chunk_{idx}.wav", speech.numpy(), samplerate=16000)
4. Combining the Chunks into a Single Audio File
from pydub import AudioSegment # Combine all audio chunks into one file combined_audio = AudioSegment.empty() for idx in range(len(text_chunks)): audio_chunk = AudioSegment.from_file(f"chunk_{idx}.wav") combined_audio += audio_chunk # Export the combined audio combined_audio.export("final_speech.wav", format="wav") print("Full speech has been saved as 'final_speech.wav'")
5. Listening to the Final AI-Generated Audio
from IPython.display import Audio # Load the saved audio file and play it Audio("final_speech.wav")
Click the play button to hear the synthesized speech:
By leveraging Microsoft’s SpeechT5 model and chunking, you can efficiently convert extensive text into high-quality speech. This approach is ideal for audiobooks, storytelling, and creating personalized audio experiences.
Try it out and bring your text to life with the power of AI!