Extractor

Extractor

The Extractor pipeline joins a prompt, context data store and generative model together to extract knowledge.

The data store can be an embeddings database or a similarity instance with associated input text. The generative model can be a prompt-driven large language model (LLM), an extractive question-answering model or a custom pipeline. This is known as prompt-driven search or retrieval augmented generation (RAG).

Example

The following shows a simple example using this pipeline.

from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor
# LLM prompt
def prompt(question):
  return f"""
  Answer the following question using the provided context.
  Question:
  {question}
  Context:
  """
# Input data
data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]
# Build embeddings index
embeddings = Embeddings({"content": True})
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])
# Create and run pipeline
extractor = Extractor(embeddings, "google/flan-t5-base")
extractor([{"query": "What was won?", "question": prompt("What was won?")}])

See the links below for more detailed examples.

Notebook	Description
Prompt-driven search with LLMs	Embeddings-guided and Prompt-driven search with Large Language Models (LLMs)
Prompt templates and task chains	Build model prompts and connect tasks together with workflows
Extractive QA with txtai	Introduction to extractive question-answering with txtai
Extractive QA with Elasticsearch	Run extractive question-answering queries with Elasticsearch
Extractive QA to build structured data	Build structured datasets using extractive question-answering

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

# Allow documents to be indexed
writable: True
# Content is required for extractor pipeline
embeddings:
  content: True
extractor:
  path: google/flan-t5-base
workflow:
  search:
    tasks:
      - task: extractor
        template: |
          Answer the following question using the provided context.
          Question:
          {text}
          Context:
        action: extractor

Run with Workflows

Built in tasks make using the extractor pipeline easier.

from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
app.add([
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
])
app.index()
list(app.workflow("search", ["What was won?"]))

Run with API

CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
  -X POST "http://localhost:8000/workflow" \
  -H "Content-Type: application/json" \
  -d '{"name": "search", "elements": ["What was won"]}'

Methods

Python documentation for the pipeline.

Builds a new extractor.

Parameters:

Name	Description	Default
`similarity`	similarity instance (embeddings or similarity pipeline)	required
`path`	path to model, supports Questions, Generator, Sequences or custom pipeline	required
`quantize`	True if model should be quantized before inference, False otherwise.	`False`
`gpu`	if gpu inference should be used (only works if GPUs are available)	`True`
`model`	optional existing pipeline model to wrap	`None`
`tokenizer`	Tokenizer class	`None`
`minscore`	minimum score to include context match, defaults to None	`None`
`mintokens`	minimum number of tokens to include context match, defaults to None	`None`
`context`	topn context matches to include, defaults to 3	`None`
`task`	model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect	`None`
`output`	output format, ‘default’ returns (name, answer), ‘flatten’ returns answers and ‘reference’ returns (name, answer, reference)	`‘default’`
`kwargs`	additional keyword arguments to pass to pipeline model	`{}`

Source code in txtai/pipeline/text/extractor.py

def __init__(
    self,
    similarity,
    path,
    quantize=False,
    gpu=True,
    model=None,
    tokenizer=None,
    minscore=None,
    mintokens=None,
    context=None,
    task=None,
    output="default",
    **kwargs,
):
    """
    Builds a new extractor.
    Args:
        similarity: similarity instance (embeddings or similarity pipeline)
        path: path to model, supports Questions, Generator, Sequences or custom pipeline
        quantize: True if model should be quantized before inference, False otherwise.
        gpu: if gpu inference should be used (only works if GPUs are available)
        model: optional existing pipeline model to wrap
        tokenizer: Tokenizer class
        minscore: minimum score to include context match, defaults to None
        mintokens: minimum number of tokens to include context match, defaults to None
        context: topn context matches to include, defaults to 3
        task: model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect
        output: output format, 'default' returns (name, answer), 'flatten' returns answers and 'reference' returns (name, answer, reference)
        kwargs: additional keyword arguments to pass to pipeline model
    """
    # Similarity instance
    self.similarity = similarity
    # Question-Answer model. Can be prompt-driven LLM or extractive qa
    self.model = self.load(path, quantize, gpu, model, task, **kwargs)
    # Tokenizer class use default method if not set
    self.tokenizer = tokenizer if tokenizer else Tokenizer() if hasattr(self.similarity, "scoring") and self.similarity.scoring else None
    # Minimum score to include context match
    self.minscore = minscore if minscore is not None else 0.0
    # Minimum number of tokens to include context match
    self.mintokens = mintokens if mintokens is not None else 0.0
    # Top n context matches to include for context
    self.context = context if context else 3
    # Output format
    self.output = output

Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context. A model is then run against the context for each input question, with the answer returned.

Parameters:

Name	Type	Description	Default
`queue`		input question queue (name, query, question, snippet), can be list of tuples or dicts	required
`texts`		optional list of text for context, otherwise runs embeddings search	`None`

Returns:

Type	Description
	list of answers matching input format (tuple or dict) containing fields as specified by output format

Source code in txtai/pipeline/text/extractor.py

def __call__(self, queue, texts=None):
    """
    Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context.
    A model is then run against the context for each input question, with the answer returned.
    Args:
        queue: input question queue (name, query, question, snippet), can be list of tuples or dicts
        texts: optional list of text for context, otherwise runs embeddings search
    Returns:
        list of answers matching input format (tuple or dict) containing fields as specified by output format
    """
    # Save original queue format
    inputs = queue
    # Convert dictionary inputs to tuples
    if queue and isinstance(queue[0], dict):
        # Convert dict to tuple
        queue = [tuple(row.get(x) for x in ["name", "query", "question", "snippet"]) for row in queue]
    # Rank texts by similarity for each query
    results = self.query([query for _, query, _, _ in queue], texts)
    # Build question-context pairs
    names, queries, questions, contexts, topns, snippets = [], [], [], [], [], []
    for x, (name, query, question, snippet) in enumerate(queue):
        # Get top n best matching segments
        topn = sorted(results[x], key=lambda y: y[2], reverse=True)[: self.context]
        # Generate context using ordering from texts, if available, otherwise order by score
        context = " ".join(text for _, text, _ in (sorted(topn, key=lambda y: y[0]) if texts else topn))
        names.append(name)
        queries.append(query)
        questions.append(question)
        contexts.append(context)
        topns.append(topn)
        snippets.append(snippet)
    # Run pipeline and return answers
    answers = self.answers(names, questions, contexts, [[text for _, text, _ in topn] for topn in topns], snippets)
    # Apply output formatting to answers and return
    return self.apply(inputs, queries, answers, topns)