Transform Long Text into Speech with Microsoft’s SpeechT5 Using Chunking

Creating text-to-speech (TTS) solutions for lengthy content often poses challenges, such as maintaining natural speech quality and managing computational limitations.

In this blog, we’ll explore how to convert long text into high-quality speech using Microsoft’s SpeechT5 model and chunking techniques, ensuring seamless audio synthesis.

Introducing Microsoft’s SpeechT5

Microsoft’s SpeechT5 is a powerful text-to-speech model available on Hugging Face. It supports speaker embeddings for personalized speech synthesis, making it ideal for creating natural and expressive audio outputs. With pre-trained models, you can generate speech without extensive training, saving time and computational resources. TTS solutions enhance AI chatbot development by making interactions more engaging and dynamic.. This guide will use the SpeechT5 model with the Hugging Face Transformers library in a Google Colab environment.

Why Use Chunking for Text-to-Speech?

Chunking is the process of breaking down long text into smaller, manageable pieces. This approach ensures that:

The model doesn’t hit token length limits.
Speech output remains natural and intelligible.
Resource usage is optimized, especially when processing on limited hardware.

We’ll walk you through the Python implementation using the Hugging Face Transformers library and the SpeechT5 model.

Prerequisites

Ensure you have Python installed and set up your environment with the required libraries:

!pip install --upgrade pip
!pip install --upgrade transformers sentencepiece datasets
!pip install torch
!pip install pydub
!apt-get install ffmpeg

Step-by-Step Implementation

1. Setting Up the SpeechT5 Pipeline

from transformers import pipeline
import torch
from datasets import load_dataset
import soundfile as sf

# Initialize the text-to-speech pipeline
synthesiser = pipeline("text-to-speech", "microsoft/speecht5_tts")

# Load speaker embedding dataset
embeddings_dataset = load_dataset("Matthijs/cmu-arctic-xvectors", split="validation")
speaker_embedding = torch.tensor(embeddings_dataset[7306]["xvector"]).unsqueeze(0)

# Generate sample speech
speech = synthesiser("Hello, my dog is cooler than you!", forward_params={"speaker_embeddings": speaker_embedding})
sf.write("speech.wav", speech["audio"], samplerate=speech["sampling_rate"])

2. Defining the Text and Chunking Function

We use a sample passage to demonstrate splitting text into smaller chunks:

# Full input text
text = """Centuries ago, in a small village of Kaladi in Kerala, a prodigious boy named Adi Shankaracharya was born. From a young age, he showed remarkable wisdom and a deep desire for spiritual knowledge.

At just eight years old, Adi sought his mother’s permission to become a monk. Despite her hesitation, she eventually blessed him. He set out on a journey across India, walking barefoot to spread the message of Advaita Vedanta—the philosophy of non-duality, teaching that the soul and the divine are one.

One day, in Varanasi, he encountered a low-caste man blocking his path. When asked to move, the man replied, “If everything is one, what should move away from what?” Realizing the profound truth in his words, Shankaracharya bowed and wrote his famous hymn, Manisha Panchakam.

Through debates with scholars, Shankaracharya revived Sanatana Dharma, unified spiritual practices, and established the four mathas or monastic centers that continue to guide seekers to this day."""

# Function to split text into smaller chunks
def split_text(text, max_length=200):
  sentences = text.split(". ")
  chunks = []
  current_chunk = ""
  for sentence in sentences:
    if len(current_chunk) + len(sentence) + 1 <= max_length:
      current_chunk += sentence + ". "
    else:
      chunks.append(current_chunk.strip())
      current_chunk = sentence + ". "
    if current_chunk:
      chunks.append(current_chunk.strip())
  return chunks

# Split the text into manageable chunks
text_chunks = split_text(text, max_length=200)

# Debugging: Print the chunks
print(f"Total Chunks: {len(text_chunks)}")
for i, chunk in enumerate(text_chunks):
  print(f"Chunk {i+1}: {chunk}")

3. Generating Speech for Each Chunk

from transformers import SpeechT5Processor, SpeechT5ForTextToSpeech, SpeechT5HifiGan
from pydub import AudioSegment

# Load models and vocoder
processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_tts")
model = SpeechT5ForTextToSpeech.from_pretrained("microsoft/speecht5_tts")
vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")

# Generate speech for each chunk
speech_outputs = []
for idx, chunk in enumerate(text_chunks):
  print(f"Processing chunk {idx+1}/{len(text_chunks)}: {chunk}")

  inputs = processor(text=chunk.strip(), return_tensors="pt")
  speech = model.generate_speech(inputs["input_ids"], speaker_embedding, vocoder=vocoder)

  # Save individual chunk audio
  speech_outputs.append(speech.numpy())

sf.write(f"chunk_{idx}.wav", speech.numpy(), samplerate=16000)

4. Combining the Chunks into a Single Audio File

from pydub import AudioSegment 
# Combine all audio chunks into one file 
combined_audio = AudioSegment.empty() 
for idx in range(len(text_chunks)): 
  audio_chunk = AudioSegment.from_file(f"chunk_{idx}.wav")
  combined_audio += audio_chunk 
# Export the combined audio 
combined_audio.export("final_speech.wav", format="wav") 
print("Full speech has been saved as 'final_speech.wav'")

5. Listening to the Final AI-Generated Audio

from IPython.display import Audio
# Load the saved audio file and play it
Audio("final_speech.wav")

Click the play button to hear the synthesized speech:

By leveraging Microsoft’s SpeechT5 model and chunking, you can efficiently convert extensive text into high-quality speech. This approach is ideal for audiobooks, storytelling, and creating personalized audio experiences.

Try it out and bring your text to life with the power of AI!