Text To Speech

Text To Speech

The Text To Speech pipeline generates speech from text.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import TextToSpeech
# Create and run pipeline
tts = TextToSpeech()
tts("Say something here")
# Stream audio - incrementally generates snippets of audio
yield from tts(
  "Say something here. And say something else.".split(),
  stream=True
)
# Generate audio using a speaker id
tts = TextToSpeech("neuml/vctk-vits-onnx")
tts("Say something here", speaker=15)
# Generate audio using speaker embeddings
tts = TextToSpeech("neuml/txtai-speecht5-onnx")
tts("Say something here", speaker=np.array(...))

See the links below for a more detailed example.

Notebook	Description
Text to speech generation	Generate speech from text
Speech to Speech RAG ▶️	Full cycle speech to speech workflow with RAG
Generative Audio	Storytelling with generative audio workflows

This pipeline is backed by ONNX models from the Hugging Face Hub. The following models are currently available.

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
texttospeech:
# Run pipeline with workflow
workflow:
  tts:
    tasks:
      - action: texttospeech

Run with Workflows

from txtai import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("tts", ["Say something here"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"tts", "elements":["Say something here"]}'

Methods

Python documentation for the pipeline.

`init(path=None, maxtokens=512, rate=22050)`

Creates a new TextToSpeech pipeline.

Parameters:

Name	Description	Default
`path`	optional model path	`None`
`maxtokens`	maximum number of tokens model can process, defaults to 512	`512`
`rate`	target sample rate, defaults to 22050	`22050`

Source code in txtai/pipeline/audio/texttospeech.py

def init(self, path=None, maxtokens=512, rate=22050):
    “””
    Creates a new TextToSpeech pipeline.
    Args:
        path: optional model path
        maxtokens: maximum number of tokens model can process, defaults to 512
        rate: target sample rate, defaults to 22050
    “””
    if not TTS:
        raise ImportError(‘TextToSpeech pipeline is not available - install “pipeline” extra to enable’)
    # Default path
    path = path if path else “neuml/ljspeech-jets-onnx”
    # Target sample rate
    self.rate = rate
    # Load target tts pipeline
    self.pipeline = ESPnet(path, maxtokens, self.providers()) if self.hasfile(path, “model.onnx”) else SpeechT5(path, maxtokens, self.providers())

`call(text, stream=False, speaker=1)`

Generates speech from text. Text longer than maxtokens will be batched and returned as a single waveform per text input.

This method supports text as a string or a list. If the input is a string, the return type is audio. If text is a list, the return type is a list.

Parameters:

Name	Description	Default
`text`	text\|list	required
`stream`	stream response if True, defaults to False	`False`
`speaker`	speaker id, defaults to 1	`1`

Returns:

Type	Description
	list of (audio, sample rate)

Source code in txtai/pipeline/audio/texttospeech.py

def call(self, text, stream=False, speaker=1):
    “””
    Generates speech from text. Text longer than maxtokens will be batched and returned
    as a single waveform per text input.
    This method supports text as a string or a list. If the input is a string,
    the return type is audio. If text is a list, the return type is a list.
    Args:
        text: text|list
        stream: stream response if True, defaults to False
        speaker: speaker id, defaults to 1
    Returns:
        list of (audio, sample rate)
    “””
    # Convert results to a list if necessary
    texts = [text] if isinstance(text, str) else text
    # Streaming response
    if stream:
        return self.stream(texts, speaker)
    # Transform text to speech
    results = [self.execute(x, speaker) for x in texts]
    # Return results
    return results[0] if isinstance(text, str) else results

Text To Speech

Text To Speech

Example

Configuration-driven example

config.yml

Run with Workflows

Run with API

Methods

__init__(path=None, maxtokens=512, rate=22050)

__call__(text, stream=False, speaker=1)

`init(path=None, maxtokens=512, rate=22050)`

`call(text, stream=False, speaker=1)`