Data Modeling and Operational Factors

Designing the data model of your application is a crucial task that can make orbreak the performance of your application. A well-designed data model willallow you to write efficient AQL queries, increase throughput of CRUD operationsand will make sure your data is distributed in the most effective way.

Whether you design a new application with ArangoDB or port an existing one touse ArangoDB, you should always analyze the (expected) data access patterns ofyour application in conjunction with several factors:

Operation Atomicity

All insert / update / replace / remove operations in ArangoDB are atomic on asingle document. Using a single instance of ArangoDB, multi-document /multi-collection queries are guaranteed to be fully ACID, however incluster mode only single-document operations are also fully ACID. This hasimplications if you try to ensure consistency across multiple operations.

Denormalizing Data

In traditional SQL databases it is considered a good practice to normalizeall your data across multiple tables to avoid duplicated data and ensureconsistency.

ArangoDB is a schema-less NoSQL multi-model database, so a good data modelis not necessarily normalized. On the contrary, to avoid extra joins it isoften an advantage to deliberately denormalize your data model.

To denormalize your data model you essentially combine all related entitiesinto a single document instead of spreading it over multiple documents andcollections. The advantage of this is that it allows you to atomically updateall of your connected data, the downside is that your documents become larger(see below for more considerations onlarge documents).

As a simple example, lets say you want to maintain the total amount of ashopping basket (from an online shop) together with a list of all includeditems and prices. The total balance of all items in the shopping basket shouldstay in sync with the contained items, then you may put all contained itemsinside the shopping basket document and only update them together:

  1. {
  2. "_id": "basket/123",
  3. "_key": "123",
  4. "_rev": "_Xv0TA0O--_",
  5. "user": "some_user",
  6. "balance": "100",
  7. "items": [ { "price": 10, "title": "Harry Potter and the Philosopher’s Stone" },
  8. { "price": 90, "title": "Vacuum XYZ" } ]
  9. }

This allows you to avoid making lookups via the document keys inmultiple collections.

Ensuring Consistent Atomic Updates

There are ways to ensure atomicity and consistency when performing updates inyour application. ArangoDB allows you to specify the revision ID (rev) valueof the existing document you want to update. The update or replace operation isonly able to succeed if the values match. This way you can ensure that if yourapplication has read a document with a certain _rev value, the modificationsto it are only allowed to pass _if and only if the document was not changed bysomeone else in the meantime. By specifying a document’s previous revision IDyou can avoid losing updates on these documents without noticing it.

You can specify the revision via the rev field inside the document or viathe If-Match: <revision> HTTP header in the documents REST API.In the _arangosh you can perform such an operation like this:

  1. db.basketCollection.update({"_key": "123", "_rev": "_Xv0TA0O--_"}, data)
  2. // or replace
  3. db.basketCollection.replace({"_key": "123", "_rev": "_Xv0TA0O--_"}, data)

An AQL query with the same effect can be written by using the _ignoreRevs_option together with a modification operation. Either let ArangoDB comparethe _rev value and only succeed if they still match, or let ArangoDBignore them (default):

  1. FOR i IN 1..1000
  2. UPDATE { _key: CONCAT('test', i), _rev: "1287623" }
  3. WITH { foobar: true } IN users
  4. OPTIONS { ignoreRevs: false }

Indexes

Indexes can improve the performance of AQL queries drastically. Queries thatfrequently filter on or one more fields can be made faster by creating an index(in arangosh via the ensureIndex command, the Web UI or your specificclient driver). There is already an automatic (and non-deletable) primary indexin every collection on the _key and _id fields as well as the edge indexon _from and _to (for edge collections).

Should you decide to create an index you should consider a few things:

  • Indexes are a trade-off between storage space, maintenance cost and query speed.
  • Each new index will increase the amount of RAM and (for the RocksDB storage)the amount of disk space needed.
  • Indexes with indexed array valuesneed an extra index entry per array entry
  • Adding indexes increases the write-amplification i.e. it negatively affectsthe write performance (how much depends on the storage engine)
  • Each index needs to add at least one index entry per document. You can usesparse indexes to avoid adding null index entries for rarely used attributes
  • Sparse indexes can be smaller than non-sparse indexes, but they can only beused if the optimizer determines that the null value cannot be in theresult range, e.g. by an explicit FILTER doc.attribute != null in AQL(also see Type and value order).
  • Collections that are more frequently read benefit the most from added indexes,provided the indexes can actually be utilized
  • Indexes on collections with a high rate of inserts or updates compared toreads may hurt overall performance.

Generally it is best to design your indexes with your queries in mind.Use the query profilerto understand the bottlenecks in your queries.

Always consider the additional space requirements of extra indexes whenplanning server capacities. For more information on indexes seeIndex Basics.

Number of Databases and Collections

Sometimes you can consider to split up data over multiple collections.For example, one could create a new set of collections for each new customerinstead of having a customer field on each documents. Having a few thousandcollections has no significant performance penalty for most operations andresults in good performance.

Grouping documents into collections by type (i.e. a session collection‘sessions_dev’, ‘sessions_prod’) allows you to avoid an extra index on a _type_field. Similarly you may consider tosplit edge collectionsinstead of specifying the type of the connection inside the edge document.

A few things to consider:

  • Adding an extra collection always incurs a small amount of overhead for thecollection metadata and indexes.
  • You cannot use more than 2048 collections/shards per AQL query
  • Uniqueness constraints on certain attributes (via an unique index) can onlybe enforced by ArangoDB within one collection
  • Only with the MMFiles storage engine: Creating extra databases will requiretwo compaction and cleanup threads per database. This might lead toundesirable effects should you decide to create many databases compared tothe number of available CPU cores.

Cluster Sharding

The ArangoDB cluster partitions your collections into one or more shards_across multiple _DBServers. This enables efficient horizontal scaling:It allows you to store much more data, since ArangoDB distributes the dataautomatically to the different servers. In many situations one can also reapa benefit in data throughput, again because the load can be distributed tomultiple machines.

ArangoDB uses the specified shard keys to determine in which shard a givendocument is stored. Choosing the right shard key can have significant impact onyour performance can reduce network traffic and increase performance.

ArangoDB uses consistent hashing to compute the target shard from the givenvalues (as specified via ‘shardKeys’). The ideal set of shard keys allowsArangoDB to distribute documents evenly across your shards and your DBServers.By default ArangoDB uses the _key field as a shard key. For a custom shard keyyou should consider a few different properties:

  • Cardinality: The cardinality of a set is the number of distinct valuesthat it contains. A shard key with only N distinct values can not be hashedonto more than N shards. Consider using multiple shard keys, if one of yourvalues has a low cardinality.
  • Frequency: Consider how often a given shard key value may appear inyour data. Having a lot of documents with identical shard keys will leadto unevenly distributed data. Consider using multiple shard keys or a differentone that is more suitable.

The default sharding should randomly distribute your documents across yourcluster machines. This may be good enough for you, but depending on the kindof AQL queries and other operations an application performs, it may leavea lot of performance on the table.

See Cluster Sharding for more information.

Smart Graphs

Smart Graphs are an Enterprise Edition feature of ArangoDB. It enables you tomanage graphs at scale, it will give a vast performance benefit for all graphssharded in an ArangoDB Cluster.

To add a Smart Graph you need a smart graph attribute that partitions yourgraph into several smaller sub-graphs. Ideally these sub-graphs follow a“natural” structure in your data. These subgraphs have a large amount of edgesthat only connect vertices in the same subgraph and only have few edgesconnecting vertices from other subgraphs.

All the usual considerations for sharding keys also apply for smart attributes,for more information see SmartGraphs

Document and Transaction Sizes

When designing your data-model you should keep in mind that the size ofdocuments affects the performance and storage requirements of your system.Very large numbers of very small documents may have an unexpectedly big overhead:Each document needs has a certain amount extra storage space, depending on thestorage engine and the indexes you added to the collection. The overhead maybecome significant if your store a large amount of very small documents.

Very large documents may reduce your write throughput:This is due to the extra time needed to send larger documents over thenetwork as well as more copying work required inside the storage engines.

Consider some ways to minimize the required amount of storage space:

  • Explicitly set the _key field to a custom unique value.This enables you to store information in the _key field instead of anotherfield inside the document. The _key value is always indexed, setting acustom value means you can use a shorter value than what would have beengenerated automatically.
  • Shorter field names will reduce the amount of space needed to store documents(this has no effect on index size). ArangoDB is schemaless and needs to storethe document structure inside each document. Usually this is a small overheadcompared to the overall document size.
  • Combining many small related documents into one larger one can alsoreduce overhead. Common fields can be stored once and indexes just need tostore one entry. This will only be beneficial if the combined documents areregularly retrieved together and not just subsets.

RockDB Storage Engine

Especially for the RocksDB storage engine large documents and transactions maynegatively impact the write performance:

  • Consider a maximum size of 50-75 kB per document as a good rule of thumb.This will allow you to maintain steady write throughput even under very high load.
  • Transactions are held in-memory before they are committed.This means that transactions have to be split if they become too big, see thelimitations section.

Improving Update Query Perfromance

You may use the exclusive query option for modifying AQL queries, to improve the performance drastically.This has the downside that no concurrent writes may occur on the collection, but ArangoDB is ableto use a special fast-path which should improve the performance by up to 50% for large collections.

  1. FOR doc IN mycollection
  2. UPDATE doc._key
  3. WITH { foobar: true } IN mycollection
  4. OPTIONS { exclusive: true }

The same naturally also applies for queries using REPLACE or INSERT. Additionally you may be able to usethe intermediateCommitCount option in the API to subdivide the AQL transaction into smaller batches.

Read / Write Load Balance

Depending on whether your data model has a higher read- or higher write-rate you may wantto adjust some of the RocksDB specific options. Some of the most critical options toadjust the performance and memory usage are listed below:

—rocksdb.block-cache-size

This is the size of the block cache in bytes. This cache is used for read operations.Increasing the size of this may improve the performance of read heavy workloads.You may wish to adjust this parameter to control memory usage.

—rocksdb.write-buffer-size

Amount of data to build up in memory before converting to a file on disk.Larger values increase performance, especially during bulk loads.

—rocksdb.max-write-buffer-number

Maximum number of write buffers that built up in memory, per internal column family.The default and the minimum number is 2, so that when 1 write bufferis being flushed to storage, new writes can continue to the other write buffer.

—rocksdb.total-write-buffer-size

The total amount of data to build up in all in-memory buffers when writing into ArangoDB.You may wish to adjust this parameter to control memory usage.

Setting this to a low value may limit the RAM that ArangoDB will use but may slow downwrite heavy workloads. Setting this to 0 will not limit the size of the write-buffers.

—rocksdb.level0-stop-trigger

When this many files accumulate in level-0, writes will be stopped to allow compaction to catch up.Setting this value very high may improve write throughput, but may lead to temporarily bad read performance.