Translation
The Translation pipeline translates text between languages. It supports over 100+ languages. Automatic source language detection is built-in. This pipeline detects the language of each input text row, loads a model for the source-target combination and translates text to the target language.
Example
The following shows a simple example using this pipeline.
from txtai.pipeline import Translation
# Create and run pipeline
translate = Translation()
translate("This is a test translation into Spanish", "es")
See the link below for a more detailed example.
Notebook | Description | |
---|---|---|
Translate text between languages | Streamline machine translation and language detection |
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
translation:
# Run pipeline with workflow
workflow:
translate:
tasks:
- action: translation
args: ["es"]
Run with Workflows
from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("translate", ["This is a test translation into Spanish"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/workflow" \
-H "Content-Type: application/json" \
-d '{"name":"translate", "elements":["This is a test translation into Spanish"]}'
Methods
Python documentation for the pipeline.
Constructs a new language translation pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path | optional path to model, accepts Hugging Face model hub id or local path, uses default model for task if not provided | None | |
quantize | if model should be quantized, defaults to False | False | |
gpu | True/False if GPU should be enabled, also supports a GPU device id | True | |
batch | batch size used to incrementally process content | 64 | |
langdetect | set a custom language detection function, method must take a list of strings and return language codes for each, uses default language detector if not provided | None | |
findmodels | True/False if the Hugging Face Hub will be searched for source-target translation models | True |
Source code in txtai/pipeline/text/translation.py
def __init__(self, path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):
"""
Constructs a new language translation pipeline.
Args:
path: optional path to model, accepts Hugging Face model hub id or local path,
uses default model for task if not provided
quantize: if model should be quantized, defaults to False
gpu: True/False if GPU should be enabled, also supports a GPU device id
batch: batch size used to incrementally process content
langdetect: set a custom language detection function, method must take a list of strings and return
language codes for each, uses default language detector if not provided
findmodels: True/False if the Hugging Face Hub will be searched for source-target translation models
"""
# Call parent constructor
super().__init__(path if path else "facebook/m2m100_418M", quantize, gpu, batch)
# Language detection
self.detector = None
self.langdetect = langdetect
self.findmodels = findmodels
# Language models
self.models = {}
self.ids = self.modelids()
Translates text from source language into target language.
This method supports texts as a string or a list. If the input is a string, the return type is string. If text is a list, the return type is a list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
texts | text|list | required | |
target | target language code, defaults to “en” | ‘en’ | |
source | source language code, detects language if not provided | None |
Returns:
Type | Description |
---|---|
list of translated text |
Source code in txtai/pipeline/text/translation.py
def __call__(self, texts, target="en", source=None, showmodels=False):
"""
Translates text from source language into target language.
This method supports texts as a string or a list. If the input is a string,
the return type is string. If text is a list, the return type is a list.
Args:
texts: text|list
target: target language code, defaults to "en"
source: source language code, detects language if not provided
Returns:
list of translated text
"""
values = [texts] if not isinstance(texts, list) else texts
# Detect source languages
languages = self.detect(values) if not source else [source] * len(values)
unique = set(languages)
# Build a dict from language to list of (index, text)
langdict = {}
for x, lang in enumerate(languages):
if lang not in langdict:
langdict[lang] = []
langdict[lang].append((x, values[x]))
results = {}
for language in unique:
# Get all indices and text values for a language
inputs = langdict[language]
# Translate text in batches
outputs = []
for chunk in self.batch([text for _, text in inputs], self.batchsize):
outputs.extend(self.translate(chunk, language, target, showmodels))
# Store output value
for y, (x, _) in enumerate(inputs):
if showmodels:
model, op = outputs[y]
results[x] = (op.strip(), language, model)
else:
results[x] = outputs[y].strip()
# Return results in same order as input
results = [results[x] for x in sorted(results)]
return results[0] if isinstance(texts, str) else results