NLP and tokenization

Data tokenization

Manticore doesn’t store text as is for performing full-text searching on it. Instead it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching and it operates at character and word level.

On the character level, the engine allows only certain characters to pass, this is defined by the charset_table, anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, for example lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.

At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.

Going further, we might want a word to be matched as another one - because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.

Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only on speeding queries, but also on decreasing index size.

A more advanced blacklisting is bigrams, which allows creating a special token between a ‘bigram’ (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.

In case of indexing HTML content, it’s important to not index the HTML tags, as they can introduce a lot of ‘noise’ in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore content of certain HTML elements.

Supported languages

Manticore supports many languages. Basic support for most is enabled by default via charset_table = non_cjk (which is a default value).

For many languages we provide stopwords file (you can also use your own one) which you can use to improve search relevance.

For few languages advanced morphology is available that allows to improve search relevance significantly by better segmentation and normalization using dictionary based lemmatization or stemming algorithms.

The below table includes a complete list of supported languages. You can use it to find out how to enable:

basic support (column “Supported”)
stopwords (column “Stopwords”)
advanced morphology (column “Advanced morphology”)

Language	Supported	Stopwords file name	Advanced morphology	Notes
Afrikaans	charset_table=non_cjk	af	-
Arabic	charset_table=non_cjk	ar	morphology=stem_ar (Arabic stemmer); morphology=libstemmer_ar
Armenian	charset_table=non_cjk	hy	-
Assamese	specify charset_table specify charset_table manually	-	-
Basque	charset_table=non_cjk	eu	-
Bengali	charset_table=non_cjk	bn	-
Bishnupriya	specify charset_table manually	-	-
Buhid	specify charset_table manually	-	-
Bulgarian	charset_table=non_cjk	bg	-
Catalan	charset_table=non_cjk	ca	morphology=libstemmer_ca
Chinese	charset_table=chinese or ngram_chars=chinese	zh	morphology=icu_chinese or ngram_chars=1 correspondingly	ICU dictionary based segmentation is much more accurate than ngram-based
Croatian	charset_table=non_cjk	hr	-
Kurdish	charset_table=non_cjk	ckb	-
Czech	charset_table=non_cjk	cz	morphology=stem_cz (Czech stemmer)
Danish	charset_table=non_cjk	da	morphology=libstemmer_da
Dutch	charset_table=non_cjk	nl	morphology=libstemmer_nl
English	charset_table=non_cjk	en	morphology=lemmatize_en (single root form); morphology=lemmatize_en_all (all root forms); morphology=stem_en (Porter’s English stemmer); morphology=stem_enru (Porter’s English and Russian stemmers); morphology=libstemmer_en (English from libstemmer)
Esperanto	charset_table=non_cjk	eo	-
Estonian	charset_table=non_cjk	et	-
Finnish	charset_table=non_cjk	fi	morphology=libstemmer_fi
French	charset_table=non_cjk	fr	morphology=libstemmer_fr
Galician	charset_table=non_cjk	gl	-
Garo	specify charset_table manually	-	-
German	charset_table=non_cjk	de	morphology=lemmatize_de (single root form); morphology=lemmatize_de_all (all root forms); morphology=libstemmer_de
Greek	charset_table=non_cjk	el	morphology=libstemmer_el
Hebrew	charset_table=non_cjk	he	-
Hindi	charset_table=non_cjk	hi	morphology=libstemmer_hi
Hmong	specify charset_table manually	-	-
Ho	specify charset_table manually	-	-
Hungarian	charset_table=non_cjk	hu	morphology=libstemmer_hu
Indonesian	charset_table=non_cjk	id	morphology=libstemmer_id
Irish	charset_table=non_cjk	ga	morphology=libstemmer_ga
Italian	charset_table=non_cjk	it	morphology=libstemmer_it
Japanese	ngram_chars=japanese	-	ngram_chars=japanese ngram_len=1	Requires ngram-based segmentation
Komi	specify charset_table manually	-	-
Korean	ngram_chars=korean	-	ngram_chars=korean ngram_len=1	Requires ngram-based segmentation
Large Flowery Miao	specify charset_table manually	-	-
Latin	charset_table=non_cjk	la	-
Latvian	charset_table=non_cjk	lv	-
Lithuanian	charset_table=non_cjk	lt	morphology=libstemmer_lt
Maba	specify charset_table manually	-	-
Maithili	specify charset_table manually	-	-
Marathi	specify charset_table manually	-	-
Marathi	charset_table=non_cjk	mr	-
Mende	specify charset_table manually	-	-
Mru	specify charset_table manually	-	-
Myene	specify charset_table manually	-	-
Nepali	specify charset_table manually	-	morphology=libstemmer_ne
Ngambay	specify charset_table manually	-	-
Norwegian	charset_table=non_cjk	no	morphology=libstemmer_no
Odia	specify charset_table manually	-	-
Persian	charset_table=non_cjk	fa	-
Polish	charset_table=non_cjk	pl	-
Portuguese	charset_table=non_cjk	pt	morphology=libstemmer_pt
Romanian	charset_table=non_cjk	ro	morphology=libstemmer_ro
Russian	charset_table=non_cjk	ru	morphology=lemmatize_ru (single root form); morphology=lemmatize_ru_all (all root forms); morphology=stem_ru (Porter’s Russian stemmer); morphology=stem_enru (Porter’s English and Russian stemmers); morphology=libstemmer_ru (from libstemmer)
Santali	specify charset_table manually	-	-
Sindhi	specify charset_table manually	-	-
Slovak	charset_table=non_cjk	sk	-
Slovenian	charset_table=non_cjk	sl	-
Somali	charset_table=non_cjk	so	-
Sotho	charset_table=non_cjk	st	-
Spanish	charset_table=non_cjk	es	morphology=libstemmer_es
Swahili	charset_table=non_cjk	sw	-
Swedish	charset_table=non_cjk	sv	morphology=libstemmer_sv
Sylheti	specify charset_table manually	-	-
Tamil	specify charset_table manually	-	morphology=libstemmer_ta
Thai	charset_table=non_cjk	th	-
Turkish	charset_table=non_cjk	tr	morphology=libstemmer_tr
Ukrainian	charset_table=non_cjk,U+0406->U+0456,U+0456,U+0407->U+0457,U+0457,U+0490->U+0491,U+0491	-	morphology=lemmatize_uk_all	Requires installation of UK lemmatizer
Yoruba	charset_table=non_cjk	yo	-
Zulu	charset_table=non_cjk	zu	-