Translation

Translation

The Translation pipeline translates text between languages. It supports over 100+ languages. Automatic source language detection is built-in. This pipeline detects the language of each input text row, loads a model for the source-target combination and translates text to the target language.

Example

The following shows a simple example using this pipeline.

from txtai.pipeline import Translation
# Create and run pipeline
translate = Translation()
translate("This is a test translation into Spanish", "es")

See the link below for a more detailed example.

Notebook	Description
Translate text between languages	Streamline machine translation and language detection

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Create pipeline using lower case class name
translation:
# Run pipeline with workflow
workflow:
  translate:
    tasks:
      - action: translation
        args: ["es"]

Run with Workflows

from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.workflow("translate", ["This is a test translation into Spanish"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name":"translate", "elements":["This is a test translation into Spanish"]}'

Methods

Python documentation for the pipeline.

Constructs a new language translation pipeline.

Parameters:

Name	Description	Default
`path`	optional path to model, accepts Hugging Face model hub id or local path, uses default model for task if not provided	`None`
`quantize`	if model should be quantized, defaults to False	`False`
`gpu`	True/False if GPU should be enabled, also supports a GPU device id	`True`
`batch`	batch size used to incrementally process content	`64`
`langdetect`	set a custom language detection function, method must take a list of strings and return language codes for each, uses default language detector if not provided	`None`
`findmodels`	True/False if the Hugging Face Hub will be searched for source-target translation models	`True`

Source code in txtai/pipeline/text/translation.py

def __init__(self, path=None, quantize=False, gpu=True, batch=64, langdetect=None, findmodels=True):
    """
    Constructs a new language translation pipeline.
    Args:
        path: optional path to model, accepts Hugging Face model hub id or local path,
              uses default model for task if not provided
        quantize: if model should be quantized, defaults to False
        gpu: True/False if GPU should be enabled, also supports a GPU device id
        batch: batch size used to incrementally process content
        langdetect: set a custom language detection function, method must take a list of strings and return
                    language codes for each, uses default language detector if not provided
        findmodels: True/False if the Hugging Face Hub will be searched for source-target translation models
    """
    # Call parent constructor
    super().__init__(path if path else "facebook/m2m100_418M", quantize, gpu, batch)
    # Language detection
    self.detector = None
    self.langdetect = langdetect
    self.findmodels = findmodels
    # Language models
    self.models = {}
    self.ids = self.modelids()

Translates text from source language into target language.

This method supports texts as a string or a list. If the input is a string, the return type is string. If text is a list, the return type is a list.

Parameters:

Name	Description	Default
`texts`	text\|list	required
`target`	target language code, defaults to “en”	`‘en’`
`source`	source language code, detects language if not provided	`None`

Returns:

Type	Description
	list of translated text

Source code in txtai/pipeline/text/translation.py

def __call__(self, texts, target="en", source=None, showmodels=False):
    """
    Translates text from source language into target language.
    This method supports texts as a string or a list. If the input is a string,
    the return type is string. If text is a list, the return type is a list.
    Args:
        texts: text|list
        target: target language code, defaults to "en"
        source: source language code, detects language if not provided
    Returns:
        list of translated text
    """
    values = [texts] if not isinstance(texts, list) else texts
    # Detect source languages
    languages = self.detect(values) if not source else [source] * len(values)
    unique = set(languages)
    # Build a dict from language to list of (index, text)
    langdict = {}
    for x, lang in enumerate(languages):
        if lang not in langdict:
            langdict[lang] = []
        langdict[lang].append((x, values[x]))
    results = {}
    for language in unique:
        # Get all indices and text values for a language
        inputs = langdict[language]
        # Translate text in batches
        outputs = []
        for chunk in self.batch([text for _, text in inputs], self.batchsize):
            outputs.extend(self.translate(chunk, language, target, showmodels))
        # Store output value
        for y, (x, _) in enumerate(inputs):
            if showmodels:
                model, op = outputs[y]
                results[x] = (op.strip(), language, model)
            else:
                results[x] = outputs[y].strip()
    # Return results in same order as input
    results = [results[x] for x in sorted(results)]
    return results[0] if isinstance(texts, str) else results