Text To Speech

pipeline pipeline

The Text To Speech pipeline generates speech from text.

Example

The following shows a simple example using this pipeline.

  1. from txtai.pipeline import TextToSpeech
  2. # Create and run pipeline
  3. tts = TextToSpeech()
  4. tts("Say something here")
  5. # Stream audio - incrementally generates snippets of audio
  6. yield from tts(
  7. "Say something here. And say something else.".split(),
  8. stream=True
  9. )
  10. # Generate audio using a speaker id
  11. tts = TextToSpeech("neuml/vctk-vits-onnx")
  12. tts("Say something here", speaker=15)
  13. # Generate audio using speaker embeddings
  14. tts = TextToSpeech("neuml/txtai-speecht5-onnx")
  15. tts("Say something here", speaker=np.array(...))

See the links below for a more detailed example.

NotebookDescription
Text to speech generationGenerate speech from textOpen In Colab
Speech to Speech RAG ▶️Full cycle speech to speech workflow with RAGOpen In Colab
Generative AudioStorytelling with generative audio workflowsOpen In Colab

This pipeline is backed by ONNX models from the Hugging Face Hub. The following models are currently available.

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

  1. # Create pipeline using lower case class name
  2. texttospeech:
  3. # Run pipeline with workflow
  4. workflow:
  5. tts:
  6. tasks:
  7. - action: texttospeech

Run with Workflows

  1. from txtai import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. list(app.workflow("tts", ["Say something here"]))

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/workflow" \
  4. -H "Content-Type: application/json" \
  5. -d '{"name":"tts", "elements":["Say something here"]}'

Methods

Python documentation for the pipeline.

__init__(path=None, maxtokens=512, rate=22050)

Creates a new TextToSpeech pipeline.

Parameters:

NameTypeDescriptionDefault
path

optional model path

None
maxtokens

maximum number of tokens model can process, defaults to 512

512
rate

target sample rate, defaults to 22050

22050

Source code in txtai/pipeline/audio/texttospeech.py

  1. 38
  2. 39
  3. 40
  4. 41
  5. 42
  6. 43
  7. 44
  8. 45
  9. 46
  10. 47
  11. 48
  12. 49
  13. 50
  14. 51
  15. 52
  16. 53
  17. 54
  18. 55
  19. 56
  20. 57
  21. 58
  1. def init(self, path=None, maxtokens=512, rate=22050):
  2. “””
  3. Creates a new TextToSpeech pipeline.
  4. Args:
  5. path: optional model path
  6. maxtokens: maximum number of tokens model can process, defaults to 512
  7. rate: target sample rate, defaults to 22050
  8. “””
  9. if not TTS:
  10. raise ImportError(‘TextToSpeech pipeline is not available - install pipeline extra to enable’)
  11. # Default path
  12. path = path if path else neuml/ljspeech-jets-onnx
  13. # Target sample rate
  14. self.rate = rate
  15. # Load target tts pipeline
  16. self.pipeline = ESPnet(path, maxtokens, self.providers()) if self.hasfile(path, model.onnx”) else SpeechT5(path, maxtokens, self.providers())

__call__(text, stream=False, speaker=1)

Generates speech from text. Text longer than maxtokens will be batched and returned as a single waveform per text input.

This method supports text as a string or a list. If the input is a string, the return type is audio. If text is a list, the return type is a list.

Parameters:

NameTypeDescriptionDefault
text

text|list

required
stream

stream response if True, defaults to False

False
speaker

speaker id, defaults to 1

1

Returns:

TypeDescription

list of (audio, sample rate)

Source code in txtai/pipeline/audio/texttospeech.py

  1. 60
  2. 61
  3. 62
  4. 63
  5. 64
  6. 65
  7. 66
  8. 67
  9. 68
  10. 69
  11. 70
  12. 71
  13. 72
  14. 73
  15. 74
  16. 75
  17. 76
  18. 77
  19. 78
  20. 79
  21. 80
  22. 81
  23. 82
  24. 83
  25. 84
  26. 85
  27. 86
  28. 87
  29. 88
  1. def call(self, text, stream=False, speaker=1):
  2. “””
  3. Generates speech from text. Text longer than maxtokens will be batched and returned
  4. as a single waveform per text input.
  5. This method supports text as a string or a list. If the input is a string,
  6. the return type is audio. If text is a list, the return type is a list.
  7. Args:
  8. text: text|list
  9. stream: stream response if True, defaults to False
  10. speaker: speaker id, defaults to 1
  11. Returns:
  12. list of (audio, sample rate)
  13. “””
  14. # Convert results to a list if necessary
  15. texts = [text] if isinstance(text, str) else text
  16. # Streaming response
  17. if stream:
  18. return self.stream(texts, speaker)
  19. # Transform text to speech
  20. results = [self.execute(x, speaker) for x in texts]
  21. # Return results
  22. return results[0] if isinstance(text, str) else results