Features and Improvements in ArangoDB 2.7

The following list shows in detail which features have been added or improved inArangoDB 2.7. ArangoDB 2.7 also contains several bugfixes that are not listedhere. For a list of bugfixes, please consult theCHANGELOG.

Performance improvements

Index buckets

The primary indexes and hash indexes of collections can now be split into multipleindex buckets. This option was available for edge indexes only in ArangoDB 2.6.

A bucket can be considered a container for a specific range of index values. Forprimary, hash and edge indexes, determining the responsible bucket for an indexvalue is done by hashing the actual index value and applying a simple arithmeticoperation on the hash.

Because an index value will be present in at most one bucket and buckets are independent, using multiple buckets provides the following benefits:

  • initially building the in-memory index data can be parallelized even for a single index, with one thread per bucket (or with threads being responsiblefor more than one bucket at a time). This can help reducing the loading time for collections.

  • resizing an index when it is about to run out of reserve space is performedper bucket. As each bucket only contains a fraction of the entire index,resizing and rehashing a bucket is much faster and less intrusive than resizing and rehashing the entire index.

When creating new collections, the default number of index buckets is 8 sinceArangoDB 2.7. In previous versions, the default value was 1. The number ofbuckets can also be adjusted for existing collections so they can benefit fromthe optimizations. The number of index buckets can be set for a collection atany time by using a collection’s properties function:

  1. db.collection.properties({ indexBuckets: 16 });

The number of index buckets must be a power of 2.

Please note that for building the index data for multiple buckets in parallelit is required that a collection contains a significant amount of documents becausefor a low number of documents the overhead of parallelization will outweigh itsbenefits. The current threshold value is 256k documents, but this value may changein future versions of ArangoDB. Additionally, the configuration option —database.index-threads will determine how many parallel threads may be usedfor building the index data.

Faster update and remove operations in non-unique hash indexes

The unique hash indexes in ArangoDB provided an amortized O(1) lookup, insert, update and remove performance. Non-unique hash indexes provided amortized O(1) insertperformance, but had worse performance for update and remove operations fornon-unique values. For documents with the same index value, they maintained a list of collisions. When a document was updated or removed, that exact documenthad to be found in the collisions list for the index value. While getting to thestart of the collisions list was O(1), scanning the list had O(n) performance inthe worst case (with n being the number of documents with the same index value).Overall, this made update and remove operations in non-unique hash indexesslow if the index contained many duplicate values.

This has been changed in ArangoDB 2.7 so that non-unique hash indexes now alsoprovide update and remove operations with an amortized complexity of O(1), evenif there are many duplicates.

Resizing non-unique hash indexes now also doesn’t require looking into thedocument data (which may involve a disk access) because the index maintains someinternal cache value per document. When resizing and rehashing the index (oran index bucket), the index will first compare only the cache values beforepeeking into the actual documents. This change can also lead to reduced indexresizing times.

Throughput enhancements

The ArangoDB-internal implementations for dispatching requests, keeping statisticsand assigning V8 contexts to threads have been improved in order to use lesslocks. These changes allow higher concurrency and throughput in these components,which can also make the server handle more requests in a given period of time.

What gains can be expected depends on which operations are executed, but thereare real-world cases in which throughput increased by between 25 % and 70 % whencompared to 2.6 (see blog post).

Madvise hints

The Linux variant for ArangoDB provides the OS with madvise hints about indexmemory and datafile memory. These hints can speed up things when memory is tight, in particular at collection load time but also for random accesses later. Thereis no formal guarantee that the OS actually uses the madvise hints provided by ArangoDB, but actual measurements have shown improvements for loading biggercollections.

AQL improvements

Additional date functions

ArangoDB 2.7 provides several extra AQL functions for date and time calculation and manipulation. These functions were contributed by GitHub users @CoDEmanX and @friday. A big thanks for their work!

The following extra date functions are available from 2.7 on:

  • DATEDAYOFYEAR(date): Returns the day of year number of _date. The return values range from 1 to 365, or 366 in a leap year respectively.

  • DATEISOWEEK(date): Returns the ISO week date of _date. The return values range from 1 to 53. Monday is considered the first day of the week. There are no fractional weeks, thus the last days in December may belong to the first week of the next year, and the first days in January may be part of the previous year’s last week.

  • DATELEAPYEAR(date): Returns whether the year of _date is a leap year.

  • DATE_QUARTER(date): Returns the quarter of the given date (1-based):

    • 1: January, February, March
    • 2: April, May, June
    • 3: July, August, September
    • 4: October, November, December
  • DATE_DAYS_IN_MONTH(date): Returns the number of days in date’s month (28..31).

  • DATEADD(date, amount, unit): Adds _amount given in unit to date andreturns the calculated date.

unit can be either of the following to specify the time unit to add orsubtract (case-insensitive):

  • y, year, years
  • m, month, months
  • w, week, weeks
  • d, day, days
  • h, hour, hours
  • i, minute, minutes
  • s, second, seconds
  • f, millisecond, millisecondsamount is the number of _unit_s to add (positive value) or subtract(negative value).
  • DATESUBTRACT(date, amount, unit): Subtracts _amount given in unit fromdate and returns the calculated date.

It works the same as DATE_ADD(), except that it subtracts. It is equivalentto calling DATE_ADD() with a negative amount, except that DATE_SUBTRACT()can also subtract ISO durations. Note that negative ISO durations are notsupported (i.e. starting with -P, like -P1Y).

  • DATEDIFF(date1, date2, unit, asFloat): Calculate the differencebetween two dates in given time _unit, optionally with decimal places.Returns a negative value if date1 is greater than date2.

  • DATE_COMPARE(date1, date2, unitRangeStart, unitRangeEnd): Compare twopartial dates and return true if they match, false otherwise. The parts tocompare are defined by a range of time units.

The full range is: years, months, days, hours, minutes, seconds, milliseconds.Pass the unit to start from as unitRangeStart, and the unit to end with asunitRangeEnd. All units in between will be compared. Leave out unitRangeEnd_to only compare _unitRangeStart.

  • DATE_FORMAT(date, format): Format a date according to the given format string.It supports the following placeholders (case-insensitive):
    • %t: timestamp, in milliseconds since midnight 1970-01-01
    • %z: ISO date (0000-00-00T00:00:00.000Z)
    • %w: day of week (0..6)
    • %y: year (0..9999)
    • %yy: year (00..99), abbreviated (last two digits)
    • %yyyy: year (0000..9999), padded to length of 4
    • %yyyyyy: year (-009999 .. +009999), with sign prefix and padded to length of 6
    • %m: month (1..12)
    • %mm: month (01..12), padded to length of 2
    • %d: day (1..31)
    • %dd: day (01..31), padded to length of 2
    • %h: hour (0..23)
    • %hh: hour (00..23), padded to length of 2
    • %i: minute (0..59)
    • %ii: minute (00..59), padded to length of 2
    • %s: second (0..59)
    • %ss: second (00..59), padded to length of 2
    • %f: millisecond (0..999)
    • %fff: millisecond (000..999), padded to length of 3
    • %x: day of year (1..366)
    • %xxx: day of year (001..366), padded to length of 3
    • %k: ISO week date (1..53)
    • %kk: ISO week date (01..53), padded to length of 2
    • %l: leap year (0 or 1)
    • %q: quarter (1..4)
    • %a: days in month (28..31)
    • %mmm: abbreviated English name of month (Jan..Dec)
    • %mmmm: English name of month (January..December)
    • %www: abbreviated English name of weekday (Sun..Sat)
    • %wwww: English name of weekday (Sunday..Saturday)
    • %&: special escape sequence for rare occasions
    • %%: literal %
    • %: ignored

RETURN DISTINCT

To return unique values from a query, AQL now provides the DISTINCT keyword.It can be used as a modifier for RETURN statements, as a shorter alternative tothe already existing COLLECT statement.

For example, the following query only returns distinct (unique) statusattribute values from the collection:

  1. FOR doc IN collection
  2. RETURN DISTINCT doc.status

RETURN DISTINCT is not allowed on the top-level of a query if there is no FOR loop in front of it. RETURN DISTINCT is allowed in subqueries.

RETURN DISTINCT ensures that the values returned are distinct (unique), but doesnot guarantee any order of results. In order to have certain result order, anadditional SORT statement must be added to a query.

Shorthand object notation

AQL now provides a shorthand notation for object literals in the style of ES6object literals:

  1. LET name = "Peter"
  2. LET age = 42
  3. RETURN { name, age }

This is equivalent to the previously available canonical form, which is stillavailable and supported:

  1. LET name = "Peter"
  2. LET age = 42
  3. RETURN { name : name, age : age }

Array expansion improvements

The already existing [*] operator has been improved with optionalfiltering and projection and limit capabilities.

For example, consider the following example query that filters values froman array attribute:

  1. FOR u IN users
  2. RETURN {
  3. name: u.name,
  4. friends: (
  5. FOR f IN u.friends
  6. FILTER f.age > u.age
  7. RETURN f.name
  8. )
  9. }

With the [*] operator, this query can be simplified to

  1. FOR u IN users
  2. RETURN { name: u.name, friends: u.friends[* FILTER CURRENT.age > u.age].name }

The pseudo-variable CURRENT can be used to access the current array element.The FILTER condition can refer to CURRENT or any variables valid in theouter scope.

To return a projection of the current element, there can now be an inline RETURN:

  1. FOR u IN users
  2. RETURN u.friends[* RETURN CONCAT(CURRENT.name, " is a friend of ", u.name)]

which is the simplified variant for:

  1. FOR u IN users
  2. RETURN (
  3. FOR friend IN u.friends
  4. RETURN CONCAT(friend.name, " is a friend of ", u.name)
  5. )

Array contraction

In order to collapse (or flatten) results in nested arrays, AQL now provides the [*] operator. It works similar to the [] operator, but additionally collapses nestedarrays. How many levels are collapsed is determined by the amount of * characters used.

For example, consider the following query that produces a nested result:

  1. FOR u IN users
  2. RETURN u.friends[*].name

The [] operator can now be applied to get rid of the nested array and turn it into a flat array. We simply apply the [] on the previous queryresult:

  1. RETURN (
  2. FOR u IN users RETURN u.friends[*].name
  3. )[**]

Template query strings

Assembling query strings in JavaScript has been error-prone when using simple string concatenation, especially because plain JavaScript strings do not havemultiline-support, and because of potential parameter injection issues. While multiline query strings can be assembled with ES6 template strings since ArangoDB 2.5, and query bind parameters are there since ArangoDB 1.0 to prevent parameterinjection, there was no JavaScript-y solution to combine these.

ArangoDB 2.7 now provides an ES6 template string generator function that canbe used to easily and safely assemble AQL queries from JavaScript. JavaScriptvariables and expressions can be used easily using regular ES6 template string substitutions:

  1. let name = 'test';
  2. let attributeName = '_key';
  3. let query = aqlQuery`FOR u IN users
  4. FILTER u.name == ${name}
  5. RETURN u.${attributeName}`;
  6. db._query(query);

This is more legible than when using a plain JavaScript string and also doesnot require defining the bind parameter values separately:

  1. let name = 'test';
  2. let attributeName = '_key';
  3. let query = "FOR u IN users " +
  4. "FILTER u.name == @name " +
  5. "RETURN u.@attributeName";
  6. db._query(query, {
  7. name,
  8. attributeName
  9. });

The aqlQuery template string generator will also handle collection objectsautomatically:

  1. db._query(aqlQuery`FOR u IN ${ db.users } RETURN u.name`);

Note that while template strings are available in the JavaScript functions providedto build queries, they aren’t a feature of AQL itself. AQL could always handlemultiline query strings and provided bind parameters (@…) for separatingthe query string and the parameter values. The aqlQuery template stringgenerator function will take care of this separation, too, but will do itbehind the scenes.

AQL query result cache

The AQL query result cache can optionally cache the complete results of all or just selected AQL queries. It can be operated in the following modes:

  • off: the cache is disabled. No query results will be stored
  • on: the cache will store the results of all AQL queries unless their cacheattribute flag is set to false
  • demand: the cache will store the results of AQL queries that have theircache attribute set to true, but will ignore all others

The mode can be set at server startup using the —database.query-cache-mode configuration option and later changed at runtime. The default value is off,meaning that the query result cache is disabled. This is because the cache mayconsume additional memory to keep query results, and also because it must be invalidated when changes happen in collections for which results have beencached.

The query result cache may therefore have positive or negative effects on query execution times, depending on the workload: it will not make much sense turningon the cache in write-only or write-mostly scenarios, but the cache may bevery beneficial in case workloads are read-only or read-mostly, and query arecomplex.

If the query cache is operated in demand mode, it can be controlled per queryif the cache should be checked for a result.

Miscellaneous AQL changes

Optimizer

The AQL optimizer rule patch-update-statements has been added. This rule canoptimize certain AQL UPDATE queries that update documents in a collectionthat they also iterate over.

For example, the following query reads documents from a collection in orderto update them:

  1. FOR doc IN collection
  2. UPDATE doc WITH { newValue: doc.oldValue + 1 } IN collection

In this case, only a single collection is affected by the query, and there isno index lookup involved to find the to-be-updated documents. In this case, theUPDATE query does not require taking a full, memory-intensive snapshot of the collection, but it can be performed in small chunks. This can lead to memorysavings when executing such queries.

Function call arguments optimization

This optimization will lead to arguments in function calls inside AQL queries not being copied but being passed by reference. This may speed up calls to functions with bigger argument values or queries that call AQL functions a lot of times.

Web Admin Interface

The web interface now has a new design.

The “Applications” tab in the web interfaces has been renamed to “Services”.

The ArangoDB API documentation has been moved from the “Tools” menu to the “Links” menu. The new documentation is based on Swagger 2.0 and opens in a separate web page.

Foxx improvements

ES2015 Classes

All Foxx constructors have been replaced with ES2015 classes and can be extended using the class syntax. The extend method is still supported at the moment but will become deprecated in ArangoDB 2.8 and removed in ArangoDB 2.9.

Before:

  1. var Foxx = require('org/arangodb/foxx');
  2. var MyModel = Foxx.Model.extend({
  3. // ...
  4. schema: {/* ... */}
  5. });

After:

  1. var Foxx = require('org/arangodb/foxx');
  2. class MyModel extends Foxx.Model {
  3. // ...
  4. }
  5. MyModel.prototype.schema = {/* ... */};

Confidential configuration

It is now possible to specify configuration options with the type password. The password type is equivalent to the text type but will be masked in the web frontend to prevent accidental exposure of confidential options like API keys and passwords when configuring your Foxx application.

Dependencies

The syntax for specifying dependencies in manifests has been extended to allow specifying optional dependencies. Unmet optional dependencies will not prevent an app from being mounted. The traditional shorthand syntax for specifying non-optional dependencies will still be supported in the upcoming versions of ArangoDB.

Before:

  1. {
  2. ...
  3. "dependencies": {
  4. "notReallyNeeded": "users:^1.0.0",
  5. "totallyNecessary": "sessions:^1.0.0"
  6. }
  7. }

After:

  1. {
  2. "dependencies": {
  3. "notReallyNeeded": {
  4. "name": "users",
  5. "version": "^1.0.0",
  6. "required": false
  7. },
  8. "totallyNecessary": {
  9. "name": "sessions",
  10. "version": "^1.0.0"
  11. }
  12. }
  13. }

Replication

The existing replication HTTP API has been extended with methods that replicationclients can use to determine whether a given date, identified by a tick value, isstill present on a master for replication. By calling these APIs, clients canmake an informed decision about whether the master can still provide all missingdata starting from the point up to which the client had already synchronized.This can be helpful in case a replication client is re-started after a pause.

Master servers now also track up the point up to which they have sent changes toclients for replication. This information can be used to determine the point of data that replication clients have received from the master, and if and how far approximatelythey lag behind.

Finally, restarting the replication applier on a slave server has been made morerobust in case the applier was stopped while there were pending transactions on the master server, and re-starting the replication applier needs to restore thestate of these transactions.

Client tools

The filenames in dumps created by arangodump now contain not only the name of the dumped collection, but also an additional 32-digit hash value. This is done to prevent overwriting dump files in case-insensitive file systems when there exist multiple collections with the same name (but with different cases).

For example, if a database had two collections test and Test, previousversions of arangodump created the following files:

  • test.structure.json and test.data.json for collection test
  • Test.structure.json and Test.data.json for collection Test

This did not work in case-insensitive filesystems, because the files for thesecond collection would have overwritten the files of the first. arangodump in 2.7 will create the unique files in this case, by appending the 32-digit hashvalue to the collection name in all case. These filenames will be unambiguous even in case-insensitive filesystems.

Miscellaneous changes

Better control-C support in arangosh

When CTRL-C is pressed in arangosh, it will now abort the locally running command(if any). If no command was running, pressing CTRL-C will print a ^C first. Pressing CTRL-C again will then quit arangosh.

CTRL-C can also be used to reset the current prompt while entering complex nestedobjects which span multiple input lines.

CTRL-C support has been added to the ArangoShell versions built with Readline-support (Linux and macOS only). The Windows version of ArangoDB uses a different library for handling input, and support for CTRL-C has not been added there yet.

Start / stop

Linux startup scripts and systemd configuration for arangod now try to adjust the NOFILE (number of open files) limits for the process. The limit value is set to 131072 (128k) when ArangoDB is started via start/stop commands.

This will prevent arangod running out of available file descriptors in case ofmany parallel HTTP connections or large collections with many datafiles.

Additionally, when ArangoDB is started/stopped manually via the start/stop commands, the main process will wait for up to 10 seconds after it forks the supervisorand arangod child processes. If the startup fails within that period, thestart/stop script will fail with a non-zero exit code, allowing any invoking scripts to handle this error. Previous versions always returned an exit code of0, even when arangod couldn’t be started.

If the startup of the supervisor or arangod is still ongoing after 10 seconds, the main program will still return with exit code 0 in order to not block anyscripts. The limit of 10 seconds is arbitrary because the time required for an arangod startup is not known in advance.

Non-sparse logfiles

WAL logfiles and datafiles created by arangod are now non-sparse. This prevents SIGBUS signals being raised when a memory-mapped region backed by a sparse datafile was accessed and the memory region was not actually backed by disk, for examplebecause the disk ran out of space.

arangod now always fully allocates the disk space required for a logfile or datafilewhen it creates one, so the memory region can always be backed by disk, and memorycan be accessed without SIGBUS being raised.