Index and search analysis
Index and search analysis
Text analysis occurs at two times:
Index time
When a document is indexed, any text field values are analyzed.
Search time
When running a full-text search on a text
field, the query string (the text the user is searching for) is analyzed.
Search time is also called query time.
The analyzer, or set of analysis rules, used at each time is called the index analyzer or search analyzer respectively.
How the index and search analyzer work together
In most cases, the same analyzer should be used at index and search time. This ensures the values and query strings for a field are changed into the same form of tokens. In turn, this ensures the tokens match as expected during a search.
Example
A document is indexed with the following value in a text
field:
The QUICK brown foxes jumped over the dog!
The index analyzer for the field converts the value into tokens and normalizes them. In this case, each of the tokens represents a word:
[ quick, brown, fox, jump, over, dog ]
These tokens are then indexed.
Later, a user searches the same text
field for:
"Quick fox"
The user expects this search to match the sentence indexed earlier, The QUICK brown foxes jumped over the dog!
.
However, the query string does not contain the exact words used in the document’s original text:
Quick
vsQUICK
fox
vsfoxes
To account for this, the query string is analyzed using the same analyzer. This analyzer produces the following tokens:
[ quick, fox ]
To execute the search, Elasticsearch compares these query string tokens to the tokens indexed in the text
field.
Token | Query string | text field |
---|---|---|
| X | X |
| X | |
| X | X |
| X | |
| X | |
| X |
Because the field value and query string were analyzed in the same way, they created similar tokens. The tokens quick
and fox
are exact matches. This means the search matches the document containing "The QUICK brown foxes jumped over the dog!"
, just as the user expects.
When to use a different search analyzer
While less common, it sometimes makes sense to use different analyzers at index and search time. To enable this, Elasticsearch allows you to specify a separate search analyzer.
Generally, a separate search analyzer should only be specified when using the same form of tokens for field values and query strings would create unexpected or irrelevant search matches.
Example
Elasticsearch is used to create a search engine that matches only words that start with a provided prefix. For instance, a search for tr
should return tram
or trope
—but never taxi
or bat
.
A document is added to the search engine’s index; this document contains one such word in a text
field:
"Apple"
The index analyzer for the field converts the value into tokens and normalizes them. In this case, each of the tokens represents a potential prefix for the word:
[ a, ap, app, appl, apple]
These tokens are then indexed.
Later, a user searches the same text
field for:
"appli"
The user expects this search to match only words that start with appli
, such as appliance
or application
. The search should not match apple
.
However, if the index analyzer is used to analyze this query string, it would produce the following tokens:
[ a, ap, app, appl, appli ]
When Elasticsearch compares these query string tokens to the ones indexed for apple
, it finds several matches.
Token | appli | apple |
---|---|---|
| X | X |
| X | X |
| X | X |
| X | X |
| X |
This means the search would erroneously match apple
. Not only that, it would match any word starting with a
.
To fix this, you can specify a different search analyzer for query strings used on the text
field.
In this case, you could specify a search analyzer that produces a single token rather than a set of prefixes:
[ appli ]
This query string token would only match tokens for words that start with appli
, which better aligns with the user’s search expectations.