Extractor
The Extractor pipeline is a combination of a similarity instance (embeddings or similarity pipeline) to build a question context and a model that answers questions. The model can be a prompt-driven large language model (LLM), an extractive question-answering model or a custom pipeline.
Example
The following shows a simple example using this pipeline.
from txtai.embeddings import Embeddings
from txtai.pipeline import Extractor
# Embeddings model ranks candidates before passing to QA pipeline
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
# Create and run pipeline
extractor = Extractor(embeddings, "distilbert-base-cased-distilled-squad")
extractor([["What was won"] * 3 + [False]],
["Maine man wins $1M from $25 lottery ticket"])
See the links below for more detailed examples.
Notebook | Description | |
---|---|---|
Extractive QA with txtai | Introduction to extractive question-answering with txtai | |
Extractive QA with Elasticsearch | Run extractive question-answering queries with Elasticsearch | |
Extractive QA to build structured data | Build structured datasets using extractive question-answering | |
Prompt-driven search with LLMs | Embeddings-guided and Prompt-driven search with Large Language Models (LLMs) |
Configuration-driven example
Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.
config.yml
# Create pipeline using lower case class name
extractor:
Run with Workflows
from txtai.app import Application
# Create and run pipeline with workflow
app = Application("config.yml")
list(app.extract([{"name": "What was won", "query": "What was won",
"question", "What was won", "snippet": False}],
["Maine man wins $1M from $25 lottery ticket"]))
Run with API
CONFIG=config.yml uvicorn "txtai.api:app" &
curl \
-X POST "http://localhost:8000/extract" \
-H "Content-Type: application/json" \
-d '{"queue": [{"name":"What was won", "query": "What was won", "question": "What was won", "snippet": false}], "texts": ["Maine man wins $1M from $25 lottery ticket"]}'
Methods
Python documentation for the pipeline.
__init__(self, similarity, path, quantize=False, gpu=True, model=None, tokenizer=None, minscore=None, mintokens=None, context=None, task=None, output='default')
special
Builds a new extractor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
similarity | similarity instance (embeddings or similarity pipeline) | required | |
path | path to model, supports Questions, Generator, Sequences or custom pipeline | required | |
quantize | True if model should be quantized before inference, False otherwise. | False | |
gpu | if gpu inference should be used (only works if GPUs are available) | True | |
model | optional existing pipeline model to wrap | None | |
tokenizer | Tokenizer class | None | |
minscore | minimum score to include context match, defaults to None | None | |
mintokens | minimum number of tokens to include context match, defaults to None | None | |
context | topn context matches to include, defaults to 3 | None | |
task | model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect | None | |
output | output format, ‘default’ returns (name, answer), ‘flatten’ returns answers and ‘reference’ returns (name, answer, reference) | ‘default’ |
Source code in txtai/pipeline/text/extractor.py
def __init__(
self,
similarity,
path,
quantize=False,
gpu=True,
model=None,
tokenizer=None,
minscore=None,
mintokens=None,
context=None,
task=None,
output="default",
):
"""
Builds a new extractor.
Args:
similarity: similarity instance (embeddings or similarity pipeline)
path: path to model, supports Questions, Generator, Sequences or custom pipeline
quantize: True if model should be quantized before inference, False otherwise.
gpu: if gpu inference should be used (only works if GPUs are available)
model: optional existing pipeline model to wrap
tokenizer: Tokenizer class
minscore: minimum score to include context match, defaults to None
mintokens: minimum number of tokens to include context match, defaults to None
context: topn context matches to include, defaults to 3
task: model task (language-generation, sequence-sequence or question-answering), defaults to auto-detect
output: output format, 'default' returns (name, answer), 'flatten' returns answers and 'reference' returns (name, answer, reference)
"""
# Similarity instance
self.similarity = similarity
# Question-Answer model. Can be prompt-driven LLM or extractive qa
self.model = self.load(path, quantize, gpu, model, task)
# Tokenizer class use default method if not set
self.tokenizer = tokenizer if tokenizer else Tokenizer() if hasattr(self.similarity, "scoring") and self.similarity.scoring else None
# Minimum score to include context match
self.minscore = minscore if minscore is not None else 0.0
# Minimum number of tokens to include context match
self.mintokens = mintokens if mintokens is not None else 0.0
# Top n context matches to include for context
self.context = context if context else 3
# Output format
self.output = output
__call__(self, queue, texts=None)
special
Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context. A model is then run against the context for each input question, with the answer returned.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
queue | input question queue (name, query, question, snippet), can be list of tuples or dicts | required | |
texts | optional list of text for context, otherwise runs embeddings search | None |
Returns:
Type | Description |
---|---|
list of answers matching input format (tuple or dict) containing fields as specified by output format |
Source code in txtai/pipeline/text/extractor.py
def __call__(self, queue, texts=None):
"""
Finds answers to input questions. This method runs queries to find the top n best matches and uses that as the context.
A model is then run against the context for each input question, with the answer returned.
Args:
queue: input question queue (name, query, question, snippet), can be list of tuples or dicts
texts: optional list of text for context, otherwise runs embeddings search
Returns:
list of answers matching input format (tuple or dict) containing fields as specified by output format
"""
# Save original queue format
inputs = queue
# Convert dictionary inputs to tuples
if queue and isinstance(queue[0], dict):
# Convert dict to tuple
queue = [tuple(row.get(x) for x in ["name", "query", "question", "snippet"]) for row in queue]
# Rank texts by similarity for each query
results = self.query([query for _, query, _, _ in queue], texts)
# Build question-context pairs
names, queries, questions, contexts, topns, snippets = [], [], [], [], [], []
for x, (name, query, question, snippet) in enumerate(queue):
# Get top n best matching segments
topn = sorted(results[x], key=lambda y: y[2], reverse=True)[: self.context]
# Generate context using ordering from texts, if available, otherwise order by score
context = " ".join(text for _, text, _ in (sorted(topn, key=lambda y: y[0]) if texts else topn))
names.append(name)
queries.append(query)
questions.append(question)
contexts.append(context)
topns.append(topn)
snippets.append(snippet)
# Run pipeline and return answers
answers = self.answers(names, questions, contexts, [[text for _, text, _ in topn] for topn in topns], snippets)
# Apply output formatting to answers and return
return self.apply(inputs, queries, answers, topns)