Template table
Template table is a pseudo-table since it does not store any data and does not create any files on your disk. At the same time it can have the same NLP settings as a plain or a real-time table. Template tables can be used for a few purposes:
- as templates to inherit plain/real-time tables in the Plain mode) just to minify Manticore configuration file
- keywords generation with help of CALL KEYWORDS
- highlighting of an arbitrary string using CALL SNIPPETS
- CONFIG
CONFIG
Creating a template table via a configuration file:
table template {
type = template
morphology = stem_en
wordforms = wordforms.txt
exceptions = exceptions.txt
stopwords = stopwords.txt
}
NLP and tokenization
Data tokenization
Manticore doesn’t store text as is for performing full-text searching on it. Instead it extracts words and creates several structures that allow fast full-text searching. From the found words, a dictionary is built, which allows a quick look to discover if the word is present or not in the index. In addition, other structures record the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.
The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching and it operates at character and word level.
On the character level, the engine allows only certain characters to pass, this is defined by the charset_table, anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, for example lowercasing or simply replacing one character with another. Besides that, characters can be ignored, blended, defined as a phrase boundary.
At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used.
Going further, we might want a word to be matched as another one - because they are synonyms. For this, the word forms feature can be used, which allows one or more words to be mapped to another one.
Very common words can have some unwanted effects on searching, mostly because of their frequency they require lots of computing to process their doc/hit lists. They can be blacklisted with the stop words functionality. This helps not only on speeding queries, but also on decreasing index size.
A more advanced blacklisting is bigrams, which allows creating a special token between a ‘bigram’ (common) word and an uncommon word. This can speed up several times when common words are used in phrase searches.
In case of indexing HTML content, it’s important to not index the HTML tags, as they can introduce a lot of ‘noise’ in the index. HTML stripping can be used and can be configured to strip, but index certain tag attributes or completely ignore content of certain HTML elements.