Query guide

query query

This section covers how to query data with txtai. The simplest way to search for data is building a natural language string with the desired content to find. txtai also supports querying with SQL. We’ll cover both methods here.

Natural language queries

In the simplest case, the query is text and the results are index text that is most similar to the query text.

  1. embeddings.search("feel good story")
  2. embeddings.search("wildlife")

The queries above search the index for similarity matches on feel good story and wildlife. If content storage is enabled, a list of {**query columns} is returned. Otherwise, a list of (id, score) tuples are returned.

SQL

txtai supports more complex queries with SQL. This is only supported if content storage is enabled. txtai has a translation layer that analyzes input SQL statements and combines similarity results with content stored in a relational database.

SQL queries are run through embeddings.search like natural language queries but the examples below only show the SQL query for conciseness.

  1. embeddings.search("SQL query")

Similar clause

The similar clause is a txtai function that enables similarity searches with SQL.

  1. SELECT id, text, score FROM txtai WHERE similar('feel good story')

The similar clause takes the following arguments:

  1. similar("query", "number of candidates", "index", "weights")
ArgumentDescription
querynatural language query to run
number of candidatesnumber of candidate results to return
indextarget index name
weightshybrid score weights

The txtai query layer joins results from two separate components, a relational store and a similarity index. With a similar clause, a similarity search is run and those ids are fed to the underlying database query.

The number of candidates should be larger than the desired number of results when applying additional filter clauses. This ensures that limit results are still returned after applying additional filters. If the number of candidates is not specified, it is defaulted as follows:

  • For a single query filter clause, the default is the query limit
  • With multiple filtering clauses, the default is 10x the query limit

The index name is only applicable when subindexes are enabled. This specifies the index to use for the query.

Weights sets the hybrid score weights when an index has both a sparse and dense index.

Dynamic columns

Content can be indexed in multiple ways when content storage is enabled. Remember that input documents take the form of (id, data, tags) tuples. If data is a string or binary content, it’s indexed and searchable with similar() clauses.

If data is a dictionary, then all fields in the dictionary are stored and available via SQL. The text field or field specified in the index configuration is indexed and searchable with similar() clauses.

For example:

  1. embeddings.index([{"text": "text to index", "flag": True,
  2. "actiondate": "2022-01-01"}])

With the above input data, queries can now have more complex filters.

  1. SELECT text, flag, actiondate FROM txtai WHERE similar('query') AND flag = 1
  2. AND actiondate >= '2022-01-01'

txtai’s query layer automatically detects columns and translates queries into a format that can be understood by the underlying database.

Nested dictionaries/JSON is supported and can be escaped with bracket statements.

  1. embeddings.index([{"text": "text to index",
  2. "parent": {"child element": "abc"}}])
  1. SELECT text FROM txtai WHERE [parent.child element] = 'abc'

Note the bracket statement escaping the nested column with spaces in the name.

Bind parameters

txtai has support for SQL bind parameters.

  1. # Query with a bind parameter for similar clause
  2. query = "SELECT id, text, score FROM txtai WHERE similar(:x)"
  3. results = embeddings.search(query, parameters={"x": "feel good story"})
  4. # Query with a bind parameter for column filter
  5. query = "SELECT text, flag, actiondate FROM txtai WHERE flag = :x"
  6. results = embeddings.search(query, parameters={"x": 1})

Aggregation queries

The goal of txtai’s query language is to closely support all functions in the underlying database engine. The main challenge is ensuring dynamic columns are properly escaped into the engines native query function.

Aggregation query examples.

  1. SELECT count(*) FROM txtai WHERE similar('feel good story') AND score >= 0.15
  2. SELECT max(length(text)) FROM txtai WHERE similar('feel good story')
  3. AND score >= 0.15
  4. SELECT count(*), flag FROM txtai GROUP BY flag ORDER BY count(*) DESC

Binary objects

txtai has support for storing and retrieving binary objects. Binary objects can be retrieved as shown in the example below.

  1. # Create embeddings index with content and object storage enabled
  2. embeddings = Embeddings(content=True, objects=True)
  3. # Get an image
  4. request = open("demo.gif", "rb")
  5. # Insert record
  6. embeddings.index([("txtai", {"text": "txtai executes machine-learning workflows.",
  7. "object": request.read()})])
  8. # Query txtai and get associated object
  9. query = "SELECT object FROM txtai WHERE similar('machine learning') LIMIT 1"
  10. result = embeddings.search(query)[0]["object"]
  11. # Query binary content with a bind parameter
  12. query = "SELECT object FROM txtai WHERE similar(:x) LIMIT 1"
  13. results = embeddings.search(query, parameters={"x": request.read()})

Custom SQL functions

Custom, user-defined SQL functions extend selection, filtering and ordering clauses with additional logic. For example, the following snippet defines a function that translates text using a translation pipeline.

  1. # Translation pipeline
  2. translate = Translation()
  3. # Create embeddings index
  4. embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2",
  5. content=True,
  6. functions=[translate]})
  7. # Run a search using a custom SQL function
  8. embeddings.search("""
  9. SELECT
  10. text,
  11. translation(text, 'de', null) 'text (DE)',
  12. translation(text, 'es', null) 'text (ES)',
  13. translation(text, 'fr', null) 'text (FR)'
  14. FROM txtai WHERE similar('feel good story')
  15. LIMIT 1
  16. """)

Query translation

Natural language queries with filters can be converted to txtai-compatible SQL statements with query translation. For example:

  1. embeddings.search("feel good story since yesterday")

can be converted to a SQL statement with a similar clause and date filter.

  1. select id, text, score from txtai where similar('feel good story') and
  2. entry >= date('now', '-1 day')

This requires setting a query translation model. The default query translation model is t5-small-txtsql but this can easily be finetuned to handle different use cases.

When an embeddings database has both a sparse and dense index, both indexes will be queried and the results will be equally weighted unless otherwise specified.

  1. embeddings.search("query", weights=0.5)
  2. embeddings.search("SELECT id, text, score FROM txtai WHERE similar('query', 0.5)")

If an embeddings database has an associated graph network, graph searches can be run. The search syntax below uses openCypher. Follow the preceding link to learn more about this syntax.

Additionally, standard embeddings searches can be returned as graphs.

  1. # Find all paths between id: 0 and id: 5 between 1 and 3 hops away
  2. embeddings.graph.search("""
  3. MATCH P=({id: 0})-[*1..3]->({id: 5})
  4. RETURN P
  5. """)
  6. # Standard embeddings search as graph
  7. embeddings.search("query", graph=True)

Subindexes

Subindexes can be queried as follows:

  1. # Query with index parameter
  2. embeddings.search("query", index="subindex1")
  3. # Specify with SQL
  4. embeddings.search("""
  5. SELECT id, text, score FROM txtai
  6. WHERE similar('query', 'subindex1')
  7. """)

Combined index architecture

txtai has multiple storage and indexing components. Content is stored in an underlying database along with an approximate nearest neighbor (ANN) index, keyword index and graph network. These components combine to deliver similarity search alongside traditional structured search.

The ANN index stores ids and vectors for each input element. When a natural language query is run, the query is translated into a vector and a similarity query finds the best matching ids. When a database is added into the mix, an additional step is executed. This step takes those ids and effectively inserts them as part of the underlying database query. The same steps apply with keyword indexes except a term frequency index is used to find the best matching ids.

Dynamic columns are supported via the underlying engine. For SQLite, data is stored as JSON and dynamic columns are converted into json_extract clauses. Client-server databases are supported via SQLAlchemy and dynamic columns are supported provided the underlying engine has JSON support.