Similarity

pipeline pipeline

The Similarity pipeline computes similarity between queries and list of text using a text classifier.

This pipeline supports both standard text classification models and zero-shot classification models. The pipeline uses the queries as labels for the input text. The results are transposed to get scores per query/label vs scores per input text.

Cross-encoder models are supported via the crossencode=True constructor parameter. These models are loaded with a CrossEncoder pipeline that can also be instantiated directly. The CrossEncoder pipeline has the same methods and functionality as described below.

Example

The following shows a simple example using this pipeline.

  1. from txtai.pipeline import Similarity
  2. # Create and run pipeline
  3. similarity = Similarity()
  4. similarity("feel good story", [
  5. "Maine man wins $1M from $25 lottery ticket",
  6. "Don't sacrifice slower friends in a bear attack"
  7. ])

See the link below for a more detailed example.

NotebookDescription
Add semantic search to ElasticsearchAdd semantic search to existing search systemsOpen In Colab

Configuration-driven example

Pipelines are run with Python or configuration. Pipelines can be instantiated in configuration using the lower case name of the pipeline. Configuration-driven pipelines are run with workflows or the API.

config.yml

  1. # Create pipeline using lower case class name
  2. similarity:

Run with Workflows

  1. from txtai.app import Application
  2. # Create and run pipeline with workflow
  3. app = Application("config.yml")
  4. app.similarity("feel good story", [
  5. "Maine man wins $1M from $25 lottery ticket",
  6. "Don't sacrifice slower friends in a bear attack"
  7. ])

Run with API

  1. CONFIG=config.yml uvicorn "txtai.api:app" &
  2. curl \
  3. -X POST "http://localhost:8000/similarity" \
  4. -H "Content-Type: application/json" \
  5. -d '{"query": "feel good story", "texts": ["Maine man wins $1M from $25 lottery ticket", "Dont sacrifice slower friends in a bear attack"]}'

Methods

Python documentation for the pipeline.

__init__(self, path=None, quantize=False, gpu=True, model=None, dynamic=True, crossencode=False, **kwargs) special

Source code in txtai/pipeline/text/similarity.py

  1. def __init__(self, path=None, quantize=False, gpu=True, model=None, dynamic=True, crossencode=False, **kwargs):
  2. # Use zero-shot classification if dynamic is True and crossencode is False, otherwise use standard text classification
  3. super().__init__(path, quantize, gpu, model, False if crossencode else dynamic, **kwargs)
  4. # Load as a cross-encoder if crossencode set to True
  5. self.crossencoder = CrossEncoder(model=self.pipeline) if crossencode else None

__call__(self, query, texts, multilabel=True) special

Computes the similarity between query and list of text. Returns a list of (id, score) sorted by highest score, where id is the index in texts.

This method supports query as a string or a list. If the input is a string, the return type is a 1D list of (id, score). If text is a list, a 2D list of (id, score) is returned with a row per string.

Parameters:

NameTypeDescriptionDefault
query

query text|list

required
texts

list of text

required
multilabel

labels are independent if True, scores are normalized to sum to 1 per text item if False, raw scores returned if None

True

Returns:

TypeDescription

list of (id, score)

Source code in txtai/pipeline/text/similarity.py

  1. def __call__(self, query, texts, multilabel=True):
  2. """
  3. Computes the similarity between query and list of text. Returns a list of
  4. (id, score) sorted by highest score, where id is the index in texts.
  5. This method supports query as a string or a list. If the input is a string,
  6. the return type is a 1D list of (id, score). If text is a list, a 2D list
  7. of (id, score) is returned with a row per string.
  8. Args:
  9. query: query text|list
  10. texts: list of text
  11. multilabel: labels are independent if True, scores are normalized to sum to 1 per text item if False, raw scores returned if None
  12. Returns:
  13. list of (id, score)
  14. """
  15. if self.crossencoder:
  16. # pylint: disable=E1102
  17. return self.crossencoder(query, texts, multilabel)
  18. # Call Labels pipeline for texts using input query as the candidate label
  19. scores = super().__call__(texts, [query] if isinstance(query, str) else query, multilabel)
  20. # Sort on query index id
  21. scores = [[score for _, score in sorted(row)] for row in scores]
  22. # Transpose axes to get a list of text scores for each query
  23. scores = np.array(scores).T.tolist()
  24. # Build list of (id, score) per query sorted by highest score
  25. scores = [sorted(enumerate(row), key=lambda x: x[1], reverse=True) for row in scores]
  26. return scores[0] if isinstance(query, str) else scores