Tokenization
Tokenization is the act of taking a sentence or phrase and splitting it into smaller units of language, called tokens. It is the first step of document indexing in the Meilisearch engine, and is a critical factor in the quality of search results.
Breaking sentences into smaller chunks requires understanding where one word ends and another begins, making tokenization a highly complex and language-dependant task. MeiliSearch’s solution to this problem is a modular tokenizer that follows different processes, called pipelines, based on the language it detects.
This allows MeiliSearch to function in several different languages with zero setup.
Deep dive: The MeiliSearch tokenizer
When you add documents to a MeiliSearch index, the tokenization process is handled by an abstract interface called an analyzer. The analyzer is responsible for determining the primary language of each field
A field, or a key-value pair, is a set of two data items linked together: an attribute and its associated value.
Ex: "attribute": "value"
based on the scripts (e.g. Latin alphabet, Chinese hanzi, etc.) that are present there. Then, it applies the corresponding pipeline to each field.
We can break down the tokenization process like so:
- Crawl the document(s) and determine the primary language for each field.
- Go back over the documents field-by-field, running the corresponding tokenization pipeline, if it exists.
Pipelines include many language-specific operations. Currently, we have two pipelines:
- A specialized Chinese pipeline using Jieba (opens new window)
- A default MeiliSearch pipeline that separates words based on categories. Works with a variety of languages.
For more details, check out the feature specification (opens new window).