Index basics
Indexes allow fast access to documents, provided the indexed attribute(s)are used in a query. While ArangoDB automatically indexes some systemattributes, users are free to create extra indexes on non-system attributesof documents.
User-defined indexes can be created on collection level. Most user-defined indexes can be created by specifying the names of the index attributes.Some index types allow indexing just one attribute (e.g. fulltext index) whereas other index types allow indexing multiple attributes at the same time.
Learn how to use different indexes efficiently by going through theArangoDB Performance Course.
The system attributes _id
, _key
, _from
and _to
are automatically indexedby ArangoDB, without the user being required to create extra indexes for them._id
and _key
are covered by a collection’s primary key, and _from
and _to
are covered by an edge collection’s edge index automatically.
Using the system attribute _id
in user-defined indexes is not possible, but indexing _key
, _rev
, _from
, and _to
is.
Creating new indexes is by default done under an exclusive collection lock. The collection is notavailable while the index is being created. This “foreground” index creation can be undesirable, if you have to perform it on a live system without a dedicated maintenance window.
For potentially long running index creation operations the RocksDB storage-engine also supports creating indexes in “background”. The collection remains (mostly) available during the index creation, see the section Creating Indexes in Background for more information.
ArangoDB provides the following index types:
Primary Index
For each collection there will always be a primary index which is a hash index for the document keys (_key
attribute)of all documents in the collection. The primary index allows quick selectionof documents in the collection using either the _key
or _id
attributes. It willbe used from within AQL queries automatically when performing equality lookups on_key
or _id
.
There are also dedicated functions to find a document given its _key
or _id
that will always make use of the primary index:
db.collection.document("<document-key>");
db._document("<document-id>");
As the primary index is an unsorted hash index, it cannot be used for non-equality range queries or for sorting.
The primary index of a collection cannot be dropped or changed, and there is nomechanism to create user-defined primary indexes.
Edge Index
Every edge collection also has an automatically created edge index. The edge index provides quick access todocuments by either their _from
or _to
attributes. It can therefore beused to quickly find connections between vertex documents and is invoked when the connecting edges of a vertex are queried.
Edge indexes are used from within AQL when performing equality lookups on _from
or _to
values in an edge collections. There are also dedicated functions to find edges given their _from
or _to
values that will always make use of the edge index:
db.collection.edges("<from-value>");
db.collection.edges("<to-value>");
db.collection.outEdges("<from-value>");
db.collection.outEdges("<to-value>");
db.collection.inEdges("<from-value>");
db.collection.inEdges("<to-value>");
Internally, the edge index is implemented as a hash index, which stores the unionof all _from
and _to
attributes. It can be used for equality lookups, but not for range queries or for sorting. Edge indexes are automatically created for edge collections. It is not possible to create user-defined edge indexes.However, it is possible to freely use the _from
and _to
attributes in user-definedindexes.
An edge index cannot be dropped or changed.
Hash Index
The hash index type is deprecated for the RocksDB storage engine.It is the same as the persistent type when using RocksDB. The type _hash_is still allowed for backward compatibility in the APIs, but the web interfacedoes not offer this type anymore.
A hash index can be used to quickly find documents with specific attribute values.The hash index is unsorted, so it supports equality lookups but no range queries or sorting.
A hash index can be created on one or multiple document attributes. A hash index will only be used by a query if all index attributes are present in the search condition,and if all attributes are compared using the equality (==
) operator. Hash indexes are used from within AQL and several query functions, e.g. byExample
, firstExample
etc.
Hash indexes can optionally be declared unique, then disallowing saving the samevalue(s) in the indexed attribute(s). Hash indexes can optionally be sparse.
The different types of hash indexes have the following characteristics:
- unique hash index: all documents in the collection must have different values for the attributes covered by the unique index. Trying to insert a document with the same key value as an already existing document will lead to a unique constraint violation.
This type of index is not sparse. Documents that do not contain the index attributes or that have a value of null
in the index attribute(s) will still be indexed. A key value of null
may only occur once in the index, so this type of index cannot be used for optional attributes.
The unique option can also be used to ensure thatno duplicate edges arecreated, by adding a combined index for the fields _from
and _to
to an edge collection.
unique, sparse hash index: all documents in the collection must have different values for the attributes covered by the unique index. Documents in which at least oneof the index attributes is not set or has a value of
null
are not included in the index. This type of index can be used to ensure that there are no duplicate keys inthe collection for documents which have the indexed attributes set. As the index willexclude documents for which the indexed attributes arenull
or not set, it can beused for optional attributes.non-unique hash index: all documents in the collection will be indexed. This typeof index is not sparse. Documents that do not contain the index attributes or that have a value of
null
in the index attribute(s) will still be indexed. Duplicate key values can occur and do not lead to unique constraint violations.non-unique, sparse hash index: only those documents will be indexed that have allthe indexed attributes set to a value other than
null
. It can be used for optionalattributes.
The amortized complexity of lookup, insert, update, and removal operations in unique hash indexes is O(1).
Non-unique hash indexes have an amortized complexity of O(1) for insert, update, andremoval operations. That means non-unique hash indexes can be used on attributes with low cardinality.
If a hash index is created on an attribute that is missing in all or many of the documents,the behavior is as follows:
if the index is sparse, the documents missing the attribute will not be indexed and notuse index memory. These documents will not influence the update or removal performancefor the index.
if the index is non-sparse, the documents missing the attribute will be contained in theindex with a key value of
null
.
Hash indexes support indexing array values if the indexattribute name is extended with a [*]
.
Skiplist Index
The skiplist index type is deprecated for the RocksDB storage engine.It is the same as the persistent type when using RocksDB. The type _skiplist_is still allowed for backward compatibility in the APIs, but the web interfacedoes not offer this type anymore.
A skiplist is a sorted index structure. It can be used to quickly find documents with specific attribute values, for range queries and for returning documents fromthe index in sorted order. Skiplists will be used from within AQL and several query functions, e.g. byExample
, firstExample
etc.
Skiplist indexes will be used for lookups, range queries and sorting only if either allindex attributes are provided in a query, or if a leftmost prefix of the index attributesis specified.
For example, if a skiplist index is created on attributes value1
and value2
, the following filter conditions can use the index (note: the <=
and >=
operators are intentionally omitted here for the sake of brevity):
FILTER doc.value1 == ...
FILTER doc.value1 < ...
FILTER doc.value1 > ...
FILTER doc.value1 > ... && doc.value1 < ...
FILTER doc.value1 == ... && doc.value2 == ...
FILTER doc.value1 == ... && doc.value2 > ...
FILTER doc.value1 == ... && doc.value2 > ... && doc.value2 < ...
In order to use a skiplist index for sorting, the index attributes must be specified inthe SORT
clause of the query in the same order as they appear in the index definition.Skiplist indexes are always created in ascending order, but they can be used to accessthe indexed elements in both ascending or descending order. However, for a combined index(an index on multiple attributes) this requires that the sort orders in a single queryas specified in the SORT
clause must be either all ascending (optionally omitted as ascending is the default) or all descending.
For example, if the skiplist index is created on attributes value1
and value2
(in this order), then the following sorts clauses can use the index for sorting:
SORT value1 ASC, value2 ASC
(and its equivalentSORT value1, value2
)SORT value1 DESC, value2 DESC
SORT value1 ASC
(and its equivalentSORT value1
)SORT value1 DESC
The following sort clauses cannot make use of the index order, and require an extrasort step:
SORT value1 ASC, value2 DESC
SORT value1 DESC, value2 ASC
SORT value2
(and its equivalentSORT value2 ASC
)SORT value2 DESC
(because first indexed attributevalue1
is not used in sort clause)
Note: the latter two sort clauses cannot use the index because the sort clause does notrefer to a leftmost prefix of the index attributes.
Skiplists can optionally be declared unique, disallowing saving the same value in the indexed attribute. They can be sparse or non-sparse.
The different types of skiplist indexes have the following characteristics:
- unique skiplist index: all documents in the collection must have different values for the attributes covered by the unique index. Trying to insert a document with the same key value as an already existing document will lead to a unique constraint violation.
This type of index is not sparse. Documents that do not contain the index attributes or that have a value of null
in the index attribute(s) will still be indexed. A key value of null
may only occur once in the index, so this type of index cannot be used for optional attributes.
unique, sparse skiplist index: all documents in the collection must have different values for the attributes covered by the unique index. Documents in which at least oneof the index attributes is not set or has a value of
null
are not included in the index. This type of index can be used to ensure that there are no duplicate keys inthe collection for documents which have the indexed attributes set. As the index willexclude documents for which the indexed attributes arenull
or not set, it can beused for optional attributes.non-unique skiplist index: all documents in the collection will be indexed. This typeof index is not sparse. Documents that do not contain the index attributes or that have a value of
null
in the index attribute(s) will still be indexed. Duplicate key values can occur and do not lead to unique constraint violations.non-unique, sparse skiplist index: only those documents will be indexed that have allthe indexed attributes set to a value other than
null
. It can be used for optionalattributes.
The operational amortized complexity for skiplist indexes is logarithmically correlatedwith the number of documents in the index.
Skiplist indexes support indexing array values if the indexattribute name is extended with a [*]
`.
Persistent Index
The persistent index type is deprecated for the MMFiles storage engine.Use the RocksDB storage engine instead, where all indexes are persistent.
The index types hash, skiplist and persistent are equivalent when usingthe RocksDB storage engine. The types hash and skiplist are still allowedfor backward compatibility in the APIs, but the web interface does not offerthese types anymore.
The persistent index is a sorted index with persistence. The index entries are written todisk when documents are stored or updated. That means the index entries do not need to berebuilt from the collection data when the server is restarted or the indexed collectionis initially loaded. Thus using persistent indexes may reduce collection loading times.
The persistent index type can be used for secondary indexes at the moment. That means thepersistent index currently cannot be made the only index for a collection, because therewill always be the in-memory primary index for the collection in addition, and potentiallymore indexes (such as the edges index for an edge collection).
The index implementation is using the RocksDB engine, and it provides logarithmic complexityfor insert, update, and remove operations. As the persistent index is not an in-memoryindex, it does not store pointers into the primary index as all the in-memory indexes do,but instead it stores a document’s primary key. To retrieve a document via a persistentindex via an index value lookup, there will therefore be an additional O(1) lookup into the primary index to fetch the actual document.
As the persistent index is sorted, it can be used for point lookups, range queries and sortingoperations, but only if either all index attributes are provided in a query, or if a leftmost prefix of the index attributes is specified.
TTL (time-to-live) Index
The TTL index type provided by ArangoDB can be used for automatically removing expired documents from a collection.
A TTL index is set up by setting an expireAfter
value and by picking a single document attribute which contains the documents’ creation date and time. Documents are expired after expireAfter
seconds after their creation time. The creation timeis specified as either a numeric timestamp (Unix timestamp) or a date string in formatYYYY-MM-DDTHH:MM:SS
with optional milliseconds. All date strings will be interpretedas UTC dates.
For example, if expireAfter
is set to 600 seconds (10 minutes) and the indexattribute is “creationDate” and there is the following document:
{ "creationDate" : 1550165973 }
This document will be indexed with a creation date time value of 1550165973
,which translates to the human-readable date 2019-02-14T17:39:33.000Z
. The documentwill expire 600 seconds afterwards, which is at timestamp 1550166573
(or2019-02-14T17:49:33.000Z
in the human-readable version).
The actual removal of expired documents will not necessarily happen immediately. Expired documents will eventually removed by a background thread that is periodicallygoing through all TTL indexes and removing the expired documents. The frequency forinvoking this background thread can be configured using the —ttl.frequency
startup option.
There is no guarantee when exactly the removal of expired documents will be carriedout, so queries may still find and return documents that have already expired. Thesewill eventually be removed when the background thread kicks in and has capacity toremove the expired documents. It is guaranteed however that only documents which are past their expiration time will actually be removed.
Please note that the numeric date time values for the index attribute should be specified in milliseconds since January 1st 1970 (Unix timestamp). To calculate the current timestamp from JavaScript in this format, there is Date.now() / 1000
, to calculate it from an arbitrary Date instance, there is Date.getTime() / 1000
.
Alternatively, the index attribute values can be specified as a date string in formatYYYY-MM-DDTHH:MM:SS
with optional milliseconds. All date strings will be interpreted as UTC dates.
The above example document using a datestring attribute value would be
{ "creationDate" : "2019-02-14T17:39:33.000Z" }
In case the index attribute does not contain a numeric value nor a proper date string,the document will not be stored in the TTL index and thus will not become a candidate for expiration and removal. Providing either a non-numeric value or even no value for the index attribute is a supported way of keeping documents from being expired and removed.
Geo Index
Users can create additional geo indexes on one or multiple attributes in collections. A geo index is used to find places on the surface of the earth fast.
The geo index stores two-dimensional coordinates. It can be created on either two separate document attributes (latitude and longitude) or a single array attribute thatcontains both latitude and longitude. Latitude and longitude must be numeric values.
The geo index provides operations to find documents with coordinates nearest to a given comparison coordinate, and to find documents with coordinates that are within a specifiableradius around a comparison coordinate.
The geo index is used via dedicated functions in AQL, the simple queriesfunctions and it is implicitly applied when in AQL a SORT or FILTER is used withthe distance function. Otherwise it will not be used for other types of queriesor conditions.
Fulltext Index
A fulltext index can be used to find words, or prefixes of words inside documents. A fulltext index can be created on a single attribute only, and will index all words contained in documents that have a textual value in that attribute. Only words with a (specifiable) minimum length are indexed. Word tokenization is done using the word boundary analysis provided by libicu, which is taking into account the selected language provided at server start. Words are indexed in their lower-cased form. The index supports complete match queries (full words) and prefix queries, plus basic logical operations such as and
, or
and not
for combining partial results.
The fulltext index is sparse, meaning it will only index documents for which the indexattribute is set and contains a string value. Additionally, only words with a configurableminimum length will be included in the index.
The fulltext index is used via dedicated functions in AQL or the simple queries, but willnot be enabled for other types of queries or conditions.
For advanced full-text search capabilities consider ArangoSearch.
Indexing attributes and sub-attributes
Top-level as well as nested attributes can be indexed. For attributes at the top level,the attribute names alone are required. To index a single field, pass an array with asingle element (string of the attribute key) to the fields parameter of theensureIndex() method. To create acombined index over multiple fields, simply add more members to the fields array:
// { name: "Smith", age: 35 }
db.posts.ensureIndex({ type: "hash", fields: [ "name" ] })
db.posts.ensureIndex({ type: "hash", fields: [ "name", "age" ] })
To index sub-attributes, specify the attribute path using the dot notation:
// { name: {last: "Smith", first: "John" } }
db.posts.ensureIndex({ type: "hash", fields: [ "name.last" ] })
db.posts.ensureIndex({ type: "hash", fields: [ "name.last", "name.first" ] })
Indexing array values
If an index attribute contains an array, ArangoDB will store the entire array as the index valueby default. Accessing individual members of the array via the index is not possible thisway.
To make an index insert the individual array members into the index instead of the entire arrayvalue, a special array index needs to be created for the attribute. Array indexes can be set up like regular hash or skiplist indexes using the collection.ensureIndex()
function. To make a hash or skiplist index an array index, the index attribute name needs to be extended with [*]
when creating the index and when filtering in an AQL query using the IN
operator.
The following example creates an array hash index on the tags
attribute in a collection namedposts
:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
db.posts.insert({ tags: [ "foobar", "baz", "quux" ] });
This array index can then be used for looking up individual tags
values from AQL queries via the IN
operator:
FOR doc IN posts
FILTER 'foobar' IN doc.tags
RETURN doc
It is possible to add the array expansion operator[*]
, but it is not mandatory. You may use it to indicate that an array index is used,it is purely cosmetic however:
FOR doc IN posts
FILTER 'foobar' IN doc.tags[*]
RETURN doc
The following FILTER conditions will not use the array index:
FILTER doc.tags ANY == 'foobar'
FILTER doc.tags ANY IN 'foobar'
FILTER doc.tags IN 'foobar'
FILTER doc.tags == 'foobar'
FILTER 'foobar' == doc.tags
It is also possible to create an index on subattributes of array values. This makes senseif the index attribute is an array of objects, e.g.
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*].name" ] });
db.posts.insert({ tags: [ { name: "foobar" }, { name: "baz" }, { name: "quux" } ] });
The following query will then use the array index (this does require thearray expansion operator):
FOR doc IN posts
FILTER 'foobar' IN doc.tags[*].name
RETURN doc
If you store a document having the array which does contain elements not havingthe subattributes this document will also be indexed with the value null
, whichin ArangoDB is equal to attribute not existing.
ArangoDB supports creating array indexes with a single []
operator per index attribute. For example, creating an index as follows is *not supported:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*].name[*].value" ] });
Array values will automatically be de-duplicated before being inserted into an array index.For example, if the following document is inserted into the collection, the duplicate arrayvalue bar
will be inserted only once:
db.posts.insert({ tags: [ "foobar", "bar", "bar" ] });
This is done to avoid redundant storage of the same index value for the same document, whichwould not provide any benefit.
If an array index is declared unique, the de-duplication of array values will happen before inserting the values into the index, so the above insert operation with two identical valuesbar
will not necessarily fail
It will always fail if the index already contains an instance of the bar
value. However, ifthe value bar
is not already present in the index, then the de-duplication of the array values willeffectively lead to bar
being inserted only once.
To turn off the deduplication of array values, it is possible to set the deduplicate attributeon the array index to false
. The default value for deduplicate is true
however, so de-duplication will take place if not explicitly turned off.
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ], deduplicate: false });
// will fail now
db.posts.insert({ tags: [ "foobar", "bar", "bar" ] });
If an array index is declared and you store documents that do not have an array at the specified attributethis document will not be inserted in the index. Hence the following objects will not be indexed:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
db.posts.insert({ something: "else" });
db.posts.insert({ tags: null });
db.posts.insert({ tags: "this is no array" });
db.posts.insert({ tags: { content: [1, 2, 3] } });
An array index is able to index explicit null
values. When queried for null
values, it will only return those documents having explicitly null
stored in the array, it will not return any documents that do not have the array at all.
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]" ] });
db.posts.insert({tags: null}) // Will not be indexed
db.posts.insert({tags: []}) // Will not be indexed
db.posts.insert({tags: [null]}); // Will be indexed for null
db.posts.insert({tags: [null, 1, 2]}); // Will be indexed for null, 1 and 2
Declaring an array index as sparse does not have an effect on the array part of the index,this in particular means that explicit null
values are also indexed in the sparse version.If an index is combined from an array and a normal attribute the sparsity will apply for the attribute e.g.:
db.posts.ensureIndex({ type: "hash", fields: [ "tags[*]", "name" ], sparse: true });
db.posts.insert({tags: null, name: "alice"}) // Will not be indexed
db.posts.insert({tags: [], name: "alice"}) // Will not be indexed
db.posts.insert({tags: [1, 2, 3]}) // Will not be indexed
db.posts.insert({tags: [1, 2, 3], name: null}) // Will not be indexed
db.posts.insert({tags: [1, 2, 3], name: "alice"})
// Will be indexed for [1, "alice"], [2, "alice"], [3, "alice"]
db.posts.insert({tags: [null], name: "bob"})
// Will be indexed for [null, "bob"]
Please note that filtering using array indexes only works from within AQL queries andonly if the query filters on the indexed attribute using the IN
operator. The othercomparison operators (==
, !=
, >
, >=
, <
, <=
, ANY
, ALL
, NONE
) currentlycannot use array indexes.
Vertex centric indexes
As mentioned above, the most important indexes for graphs are the edgeindexes, indexing the _from
and _to
attributes of edge collections.They provide very quick access to all edges originating in or arrivingat a given vertex, which allows to quickly find all neighbors of a vertexin a graph.
In many cases one would like to run more specific queries, for examplefinding amongst the edges originating from a given vertex only thosewith a timestamp greater than or equal to some date and time. Exactly thisis achieved with “vertex centric indexes”. In a sense these are localizedindexes for an edge collection, which sit at every single vertex.
Technically, they are implemented in ArangoDB as indexes, which sort the complete edge collection first by from
and then by other attributesfor _OUTBOUND traversals, or first by to
and then by other attributesfor _INBOUND traversals. For traversals in ANY direction two indexesare needed, one with _from
and the other with _to
as first indexed field.
If we for example have a skiplist index on the attributes _from
and timestamp
of an edge collection, we can answer the above questionvery quickly with a single range lookup in the index.
Since ArangoDB 3.0 one can create sorted indexes (type “skiplist” and “persistent”) that index the special edge attributes _from
or _to
and additionally other attributes. Since ArangoDB 3.1, these are usedin graph traversals, when appropriate FILTER
statements are foundby the optimizer.
For example, to create a vertex centric index of the above type, you would simply do
db.edges.ensureIndex({"type":"skiplist", "fields": ["_from", "timestamp"]});
in arangosh. Then, queries like
FOR v, e, p IN 1..1 OUTBOUND "V/1" edges
FILTER e.timestamp >= "2018-07-09"
RETURN p
will be considerably faster in case there are many edges originatingfrom vertex "V/1"
but only few with a recent time stamp. Note that theoptimizer may prefer the default edge index over vertex centric indexesbased on the costs it estimates, even if a vertex centric index mightin fact be faster. Vertex centric indexes are more likely to be chosenfor highly connected graphs and with RocksDB storage engine.
Creating Indexes in Background
Introduced in: v3.5.0
Background indexing is available for the RocksDB storage engine only.
Creating new indexes is by default done under an exclusive collection lock. This meansthat the collection (or the respective shards) are not available for write operationsas long as the index is created. This “foreground” index creation can be undesirable, if you have to perform it on a live system without a dedicated maintenance window.
Indexes can also be created in “background”, not using an exclusive lock during the entire index creation. The collection remains basically available, so that other CRUD operations can run on the collection while the index is being created.This can be achieved by setting the inBackground attribute when creating an index.
To create an index in the background in arangosh just specify inBackground: true
, like in the following examples:
// create the hash index in the background
db.collection.ensureIndex({ type: "hash", fields: [ "value" ], unique: false, inBackground: true });
db.collection.ensureIndex({ type: "hash", fields: [ "email" ], unique: true, inBackground: true });
// skiplist indexes work also of course
db.collection.ensureIndex({ type :"skiplist", fields: ["abc", "cdef"], unique: true, inBackground: true });
db.collection.ensureIndex({ type :"skiplist", fields: ["abc", "cdef"], sparse: true, inBackground: true });
// also supported on fulltext indexes
db.collection.ensureIndex({ type: "geo", fields: [ "latitude", "longitude"], inBackground: true });
db.collection.ensureIndex({ type: "geo", fields: [ "latitude", "longitude"], inBackground: true });
db.collection.ensureIndex({ type: "fulltext", fields: [ "text" ], minLength: 4, inBackground: true })
Behavior
Indexes that are still in the build process will not be visible via the ArangoDB APIs. Nevertheless it is not possible to create the same index twice via the ensureIndex API while an index is still begin created. AQL queries also will not use these indexes untilthe index reports back as fully created. Note that the initial ensureIndex call or HTTP request will still block until the index is completely ready. Existing single-threaded client programs can thus safely set the inBackground option to true and continue to work as before.
Should you be building an index in the background you cannot rename or drop the collection.These operations will block until the index creation is finished. This is equally the casewith foreground indexing.
After an interrupted index build (i.e. due to a server crash) the partially built indexwill the removed. In the ArangoDB cluster the index might then be automatically recreated on affected shards.
Performance
Background index creation might be slower than the “foreground” index creation and require more RAM. Under a write heavy load (specifically many remove, update or replace operations), the background index creation needs to keep a list of removed documents in RAM. This might become unsustainable if this list grows to tens of millions of entries.
Building an index is always a write heavy operation (internally), it is always a good idea to build indexesduring times with less load.