Fulltext indices

Fulltext indices take the contents of one or more fields and split it up into tokens that are used for fulltext-search. The transformation from a text to separate tokens is done by an analyzer. In order to create fulltext search queries a fulltext index with an analyzer must be defined for the related columns.

Index definition

In CrateDB, every column’s data is indexed using the plain index method by default. Currently there are three choices related to index definition:

Warning

Creating an index after a table was already created is currently not supported, so think carefully while designing your table definition.

Disable indexing

Indexing can be turned off by using the INDEX OFF column definition. Consider that a column without an index can only be used as a result column and will never produce a hit when queried.

  1. cr> create table table_a (
  2. ... first_column text INDEX OFF
  3. ... );
  4. CREATE OK, 1 row affected (... sec)
  1. cr> insert into table_a (first_column) values ('hello');
  2. INSERT OK, 1 row affected (... sec)

When a not indexed column is queried the query will return an error.

  1. cr> select * from table_a where first_column = 'hello';
  2. SQLParseException[Cannot search on field [first_column] since it is not indexed.]

Plain index (default)

An index of type plain is indexing the input data as-is without analyzing. Using the plain index method is the default behaviour but can also be declared explicitly:

  1. cr> create table table_b1 (
  2. ... first_column text INDEX using plain
  3. ... );
  4. CREATE OK, 1 row affected (... sec)

This results in the same behaviour than without any index declaration:

  1. cr> create table table_b2 (
  2. ... first_column text
  3. ... );
  4. CREATE OK, 1 row affected (... sec)

Fulltext index with analyzer

By defining an index on a column, it’s analyzed data is indexed instead of the raw data. Thus, depending on the used analyzer, querying for the exact data may not work anymore. See Built-in analyzers for details about available builtin analyzer or Creating a custom analyzer.

If no analyzer is specified when using a fulltext index, the plain analyzer is used:

  1. cr> create table table_c (
  2. ... first_column text INDEX using fulltext
  3. ... );
  4. CREATE OK, 1 row affected (... sec)

Defining the usage of a concrete analyzer is straight forward by defining the analyzer as a parameter using the WITH statement:

  1. cr> create table table_d (
  2. ... first_column text INDEX using fulltext with (analyzer = 'english')
  3. ... );
  4. CREATE OK, 1 row affected (... sec)

Defining a named index column definition

It’s also possible to define an index column which treat the data of a given column as input. This is especially useful if you want to search for both, the exact and analyzed data:

  1. cr> create table table_e (
  2. ... first_column text,
  3. ... INDEX first_column_ft using fulltext (first_column)
  4. ... );
  5. CREATE OK, 1 row affected (... sec)

Of course defining a custom analyzer is possible here too:

  1. cr> create table table_f (
  2. ... first_column text,
  3. ... INDEX first_column_ft
  4. ... using fulltext(first_column) with (analyzer = 'english')
  5. ... );
  6. CREATE OK, 1 row affected (... sec)

Defining a composite index

Defining a composite (or combined) index is done using the same syntax as above despite multiple columns are given to the fulltext index method:

  1. cr> create table documents_a (
  2. ... title text,
  3. ... body text,
  4. ... INDEX title_body_ft
  5. ... using fulltext(title, body) with (analyzer = 'english')
  6. ... );
  7. CREATE OK, 1 row affected (... sec)

Composite indices can include nested columns within object columns as well:

  1. cr> create table documents_b (
  2. ... title text,
  3. ... author object(dynamic) as (
  4. ... name text,
  5. ... birthday timestamp with time zone
  6. ... ),
  7. ... INDEX author_title_ft using fulltext(title, author['name'])
  8. ... );
  9. CREATE OK, 1 row affected (... sec)

Creating a custom analyzer

An analyzer consists of one tokenizer, zero or more token-filters, and zero or more char-filters.

When a field-content is analyzed to become a stream of tokens, the char-filter is applied at first. It is used to filter some special chars from the stream of characters that make up the content.

Tokenizers split the possibly filtered stream of characters into tokens.

Token-filters can add tokens, delete tokens or transform them to finally produce the desired stream of tokens.

With these elements in place, analyzers provide finegrained control over building a token stream used for fulltext search. For example you can use language specific analyzers, tokenizers and token-filters to get proper search results for data provided in a certain language.

Here is a simple Example:

  1. cr> CREATE ANALYZER myanalyzer (
  2. ... TOKENIZER whitespace,
  3. ... TOKEN_FILTERS (
  4. ... lowercase,
  5. ... kstem
  6. ... ),
  7. ... CHAR_FILTERS (
  8. ... html_strip
  9. ... )
  10. ... );
  11. CREATE OK, 1 row affected (... sec)

This creates a custom analyzer called myanalyzer. It uses the built-in Whitespace tokenizer tokenizer and two built-in token filters. lowercase and kstem, as well as a mapping char-filter. : It is possible to further customize the built-in token filters, char-filters or tokenizers:

  1. cr> create ANALYZER myanalyzer_customized (
  2. ... TOKENIZER whitespace,
  3. ... TOKEN_FILTERS (
  4. ... lowercase,
  5. ... kstem
  6. ... ),
  7. ... CHAR_FILTERS (
  8. ... mymapping WITH (
  9. ... type='mapping',
  10. ... mappings = ['ph=>f', 'qu=>q', 'foo=>bar']
  11. ... )
  12. ... )
  13. ... );
  14. CREATE OK, 1 row affected (... sec)

This example creates another analyzer. This time called myanalyzer_customized. It uses the same tokenizer and token filters as in the previous example, but specifies custom options to the mapping char-filter. : The name (mymapping) is a custom name which may not conflict with built-in char-filters or other custom char-filters.

The provided type property is required as it specifies which built-in char-filter should be customized. The other option mappings is specific to the used type/char-filter.

Tokenizer and token-filters can be customized in the same way.

Note

Altering analyzers is not supported yet.

See also

CREATE ANALYZER for the syntax reference.

Built-in tokenizers for a list of built-in tokenizer.

Built-in token filters for a list of built-in token-filter.

Built-in char filter for a list of built-in char-filter.

Extending a built-in analyzer

Existing Analyzers can be used to create custom Analyzers by means of extending them.

You can extend and parameterize Built-in analyzers like this:

  1. cr> create ANALYZER "german_snowball" extends snowball WITH (
  2. ... language = 'german'
  3. ... );
  4. CREATE OK, 1 row affected (... sec)

If you extend Built-in analyzers, tokenizer, char-filter or token-filter cannot be defined. In this case use the parameters available for the extended Built-in analyzers.

If you extend custom-analyzers, every part of the analyzer that is ommitted will be taken from the extended one. Example:

  1. cr> create ANALYZER e2 EXTENDS myanalyzer (
  2. ... TOKENIZER mypattern WITH (
  3. ... type = 'pattern',
  4. ... pattern = '.*'
  5. ... )
  6. ... );
  7. CREATE OK, 1 row affected (... sec)

This analyzer will use the char-filters and token-filters from myanalyzer and will override the tokenizer with mypattern.

See also

See the reference documentation of the Built-in analyzers to get detailed information on the available analyzers.