ArangoSearch Analyzers
Analyzers parse input values and transform them into sets of sub-values,for example by breaking up text into words. If they are used in Views thenthe documents’ attribute values of the linked collections are used as inputand additional metadata is produced internally. The data can then be used forsearching and sorting to provide the most appropriate match for the specifiedconditions, similar to queries to web search engines.
Analyzers can be used on their own to tokenize and normalize strings in AQLqueries with the TOKENS()
function.
How Analyzers process values depends on their type and configuration.The configuration is comprised of type-specific properties and list of features.The features control the additional metadata to be generated to augment Viewindexes, to be able to rank results for instance.
Analyzers can be managed via an HTTP API and througha JavaScript module.
Value Handling
While most of the Analyzer functionality is geared towards text processing,there is no restriction to strings as input data type when using them throughViews – your documents could have attributes of any data type after all.
Strings are processed according to the Analyzer, whereas other primitive datatypes (null
, true
, false
, numbers) are added to the index unchanged.
The elements of arrays are unpacked, processed and indexed individually,regardless of the level of nesting. That is, strings are processed by theconfigured Analyzer(s) and other primitive values are indexed as-is.
Objects, including any nested objects, are indexed as sub-attributes.This applies to sub-objects as well as objects in arrays. Only primitive valuesare added to the index, arrays and objects can not be searched for.
Also see:
- SEARCH operation on how to query indexedvalues such as numbers and nested values
- ArangoSearch Views for details about howcompound data types (arrays, objects) get indexed
Analyzer Names
Each Analyzer has a name for identification with the followingnaming conventions:
- The name must only consist of the letters
a
toz
(both in lower andupper case), the numbers0
to9
, underscore (_
) and dash (-
) symbols.This also means that any non-ASCII names are not allowed. - It must always start with a letter.
- The maximum allowed length of a name is 254 bytes.
- Analyzer names are case-sensitive.
Custom Analyzers are stored per database, in a system collection _analyzers
.The names get prefixed with the database name and two colons, e.g.myDB::customAnalyzer
.This does not apply to the globally availablebuilt-in Analyzers, which are not stored in an_analyzers
collection.
Custom Analyzers stored in the _system
database can be referenced in queriesagainst other databases by specifying the prefixed name, e.g._system::customGlobalAnalyzer
. Analyzers stored in databases other than_system
can not be accessed from within another database however.
Analyzer Types
The currently implemented Analyzer types are:
identity
: treat value as atom (no transformation)delimiter
: split into tokens at user-defined characterstem
: apply stemming to the value as a wholenorm
: apply normalization to the value as a wholengram
: create n-grams from value with user-defined lengthstext
: tokenize into words, optionally with stemming,normalization, stop-word filtering and edge n-gram generation
Available normalizations are case conversion and accent removal(conversion of characters with diacritical marks to the base characters).
Feature / Analyzer | Identity | N-gram | Delimiter | Stem | Norm | Text |
---|---|---|---|---|---|---|
Tokenization | No | No | (Yes) | No | No | Yes |
Stemming | No | No | No | Yes | No | Yes |
Normalization | No | No | No | No | Yes | Yes |
N-grams | No | Yes | No | No | No | (Yes) |
Analyzer Properties
The valid attributes/values for the properties are dependant on what _type_is used. For example, the delimiter
type needs to know the desired delimitingcharacter(s), whereas the text
type takes a locale, stop-words and more.
Identity
An Analyzer applying the identity
transformation, i.e. returning the inputunmodified.
It does not support any properties and will ignore them.
Delimiter
An Analyzer capable of breaking up delimited text into tokens as perRFC 4180(without starting new records on newlines).
The properties allowed for this Analyzer are an object with the followingattributes:
delimiter
(string): the delimiting character(s)
Stem
An Analyzer capable of stemming the text, treated as a single token,for supported languages.
The properties allowed for this Analyzer are an object with the followingattributes:
locale
(string): a locale in the formatlanguage[_COUNTRY][.encoding][@variant]
(square brackets denote optionalparts), e.g."de.utf-8"
or"en_US.utf-8"
. Only UTF-8 encoding ismeaningful in ArangoDB.
Norm
An Analyzer capable of normalizing the text, treated as a singletoken, i.e. case conversion and accent removal.
The properties allowed for this Analyzer are an object with the followingattributes:
locale
(string): a locale in the formatlanguage[_COUNTRY][.encoding][@variant]
(square brackets denote optionalparts), e.g."de.utf-8"
or"en_US.utf-8"
. Only UTF-8 encoding ismeaningful in ArangoDB.accent
(boolean, optional):true
to preserve accented characters (default)false
to convert accented characters to their base characters
case
(string, optional):"lower"
to convert to all lower-case characters"upper"
to convert to all upper-case characters"none"
to not change character case (default)
N-gram
An Analyzer capable of producing n-grams from a specified input in a range ofmin..max (inclusive). Can optionally preserve the original input.
This Analyzer type can be used to implement substring matching.Note that it slices the input based on bytes and not characters by default(streamType). The “binary” mode supports single-byte characters only;multi-byte UTF-8 characters raise an Invalid UTF-8 sequence query error.
The properties allowed for this Analyzer are an object with the followingattributes:
min
(number): unsigned integer for the minimum n-gram lengthmax
(number): unsigned integer for the maximum n-gram lengthpreserveOriginal
(boolean):true
to include the original value as wellfalse
to produce the n-grams based on min and max only
startMarker
(string, optional): this value will be prepended to n-gramswhich include the beginning of the input. Can be used for matching prefixes.Choose a character or sequence as marker which does not occur in the input.endMarker
(string, optional): this value will be appended to n-gramswhich include the end of the input. Can be used for matching suffixes.Choose a character or sequence as marker which does not occur in the input.streamType
(string, optional): type of the input stream"binary"
: one byte is considered as one character (default)"utf8"
: one Unicode codepoint is treated as one character
Examples
With min = 4
and max = 5
, the Analyzer will produce the followingn-grams for the input string "foobar"
:
"foob"
"fooba"
"foobar"
(if preserveOriginal is enabled)"ooba"
"oobar"
"obar"
An input string "foo"
will not produce any n-gram unless preserveOriginal_is enabled, because it is shorter than the _min length of 4.
Above example but with startMarker = "^"
and endMarker = "$"
wouldproduce the following:
"^foob"
"^fooba"
"^foobar"
(if preserveOriginal is enabled)"foobar$"
(if preserveOriginal is enabled)"ooba"
"oobar$"
"obar$"
Text
An Analyzer capable of breaking up strings into individual words while alsooptionally filtering out stop-words, extracting word stems, applyingcase conversion and accent removal.
Stemming support is provided bySnowball.
The properties allowed for this Analyzer are an object with the followingattributes:
locale
(string): a locale in the formatlanguage[_COUNTRY][.encoding][@variant]
(square brackets denote optionalparts), e.g."de.utf-8"
or"en_US.utf-8"
. Only UTF-8 encoding ismeaningful in ArangoDB.accent
(boolean, optional):true
to preserve accented charactersfalse
to convert accented characters to their base characters (default)
case
(string, optional):"lower"
to convert to all lower-case characters (default)"upper"
to convert to all upper-case characters"none"
to not change character case
stemming
(boolean, optional):true
to apply stemming on returned words (default)false
to leave the tokenized words as-is
edgeNgram
(object, optional): if present, then edge n-grams are generatedfor each token (word). That is, the start of the n-gram is anchored to thebeginning of the token, whereas thengram
Analyzer would produce allpossible substrings from a single input token (within the defined lengthrestrictions). Edge n-grams can be used to cover word-based auto-completionqueries with an index, for which you should set the following other options:accent: false
,case: "lower"
and most importantlystemming: false
.min
(number, optional): minimal n-gram lengthmax
(number, optional): maximal n-gram lengthpreserveOriginal
(boolean, optional): whether to include the originaltoken even if its length is less than min or greater than max
stopwords
(array, optional): an array of strings with words to omitfrom result. Default: load words fromstopwordsPath
. To disable stop-wordfiltering provide an empty array[]
. If both stopwords andstopwordsPath are provided then both word sources are combined.stopwordsPath
(string, optional): path with a language sub-directory(e.g.en
for a localeen_US.utf-8
) containing files with words to omit.Each word has to be on a separate line. Everything after the first whitespacecharacter on a line will be ignored and can be used for comments. The filescan be named arbitrarily and have any file extension (or none).
Default: if no path is provided then the value of the environment variableIRESEARCHTEXT_STOPWORD_PATH
is used to determine the path, or if it isundefined then the current working directory is assumed. If the stopwords
attribute is provided then no stop-words are loaded from files, unless anexplicit _stopwordsPath is also provided.
Note that if the stopwordsPath can not be accessed, is missing languagesub-directories or has no files for a language required by an Analyzer,then the creation of a new Analyzer is refused. If such an issue is discovered for an existing Analyzer during startup then the server willabort with a fatal error.
Examples
The built-in text_en
Analyzer has stemming enabled (note the word endings):
- arangosh> db._query(`RETURN TOKENS("Crazy fast NoSQL-database!", "text_en")`)
Show execution results
Hide execution results
- [
- [
- "crazi",
- "fast",
- "nosql",
- "databas"
- ]
- ]
- [object ArangoQueryCursor, count: 1, cached: false, hasMore: false]
You may create a custom Analyzer with the same configuration but with stemmingdisabled like this:
- arangosh> var analyzers = require("@arangodb/analyzers")
- arangosh> analyzers.save("text_en_nostem", "text", {
- ........> locale: "en.utf-8",
- ........> case: "lower",
- ........> accent: false,
- ........> stemming: false,
- ........> stopwords: []
- ........> }, ["frequency","norm","position"])
- arangosh> db._query(`RETURN TOKENS("Crazy fast NoSQL-database!", "text_en_nostem")`)
Show execution results
Hide execution results
- {
- "name" : "_system::text_en_nostem",
- "type" : "text",
- "properties" : {
- "locale" : "en.utf-8",
- "case" : "lower",
- "stopwords" : [ ],
- "accent" : false,
- "stemming" : false
- },
- "features" : [
- "position",
- "norm",
- "frequency"
- ]
- }
- [
- [
- "crazy",
- "fast",
- "nosql",
- "database"
- ]
- ]
- [object ArangoQueryCursor, count: 1, cached: false, hasMore: false]
Custom text Analyzer with the edge n-grams feature and normalization enabled,stemming disabled and "the"
defined as stop-word to exclude it:
- arangosh> analyzers.save("text_edge_ngrams", "text", {
- ........> edgeNgram: { min: 3, max: 8, preserveOriginal: true },
- ........> locale: "en.utf-8",
- ........> case: "lower",
- ........> accent: false,
- ........> stemming: false,
- ........> stopwords: [ "the" ]
- ........> }, ["frequency","norm","position"])
- arangosh> db._query(`RETURN TOKENS("The quick brown fox jumps over the dogWithAVeryLongName", "text_edge_ngrams")`)
Show execution results
Hide execution results
- {
- "name" : "_system::text_edge_ngrams",
- "type" : "text",
- "properties" : {
- "locale" : "en.utf-8",
- "case" : "lower",
- "stopwords" : [
- "the"
- ],
- "accent" : false,
- "stemming" : false,
- "edgeNgram" : {
- "min" : 3,
- "max" : 8,
- "preserveOriginal" : true
- }
- },
- "features" : [
- "position",
- "norm",
- "frequency"
- ]
- }
- [
- [
- "qui",
- "quic",
- "quick",
- "bro",
- "brow",
- "brown",
- "fox",
- "jum",
- "jump",
- "jumps",
- "ove",
- "over",
- "dog",
- "dogw",
- "dogwi",
- "dogwit",
- "dogwith",
- "dogwitha",
- "dogwithaverylongname"
- ]
- ]
- [object ArangoQueryCursor, count: 1, cached: false, hasMore: false]
Analyzer Features
The features of an Analyzer determine what term matching capabilities will beavailable and as such are only applicable in the context of ArangoSearch Views.
The valid values for the features are dependant on both the capabilities ofthe underlying type and the query filtering and sorting functions that theresult can be used with. For example the text type will producefrequency
+ norm
+ position
and the PHRASE()
AQL function requiresfrequency
+ position
to be available.
Currently the following features are supported:
- frequency: how often a term is seen, required for
PHRASE()
- norm: the field normalization factor
- position: sequentially increasing term position, required for
PHRASE()
.If present then the frequency feature is also required
Built-in Analyzers
There is a set of built-in Analyzers which are available by default forconvenience and backward compatibility. They can not be removed.
The identity
Analyzer has the features frequency
and norm
.The Analyzers of type text
all tokenize strings with stemming enabled,no stopwords configured, case conversion set to lower
, accent removalturned on and the features frequency
, norm
and position
:
Name | Type | Language |
---|---|---|
identity | identity | none |
text_de | text | German |
text_en | text | English |
text_es | text | Spanish |
text_fi | text | Finnish |
text_fr | text | French |
text_it | text | Italian |
text_nl | text | Dutch |
text_no | text | Norwegian |
text_pt | text | Portuguese |
text_ru | text | Russian |
text_sv | text | Swedish |
text_zh | text | Chinese |