Arangoimport Details
The most convenient method to import a lot of data into ArangoDB is to use thearangoimport command-line tool. It allows you to bulk import data recordsfrom a file into a database collection. Multiple files can be imported intothe same or different collections by invoking it multiple times.
Importing into an Edge Collection
Arangoimport can also be used to import data into an existing edge collection.The import data must, for each edge to import, contain at least the from_ andto_ attributes. These indicate which other two documents the edge should connect.It is necessary that these attributes are set for all records, and point tovalid document IDs in existing collections.
Example
{ "_from" : "users/1234", "_to" : "users/4321", "desc" : "1234 is connected to 4321" }
Note: The edge collection must already exist when the import is started. Usingthe —create-collection flag will not work because arangoimport will always try tocreate a regular document collection if the target collection does not exist.
Attribute Naming and Special Attributes
Attributes whose names start with an underscore are treated in a special way byArangoDB:
- the optional _key attribute contains the document’s key. If specified, the valuemust be formally valid (e.g. must be a string and conform to the naming conventions).Additionally, the key value must be unique within thecollection the import is run for.
- from_: when importing into an edge collection, this attribute contains the idof one of the documents connected by the edge. The value of from_ must be asyntactically valid document id and the referred collection must exist.
- to_: when importing into an edge collection, this attribute contains the idof the other document connected by the edge. The value of to_ must be asyntactically valid document id and the referred collection must exist.
- _rev: this attribute contains the revision number of a document. However, therevision numbers are managed by ArangoDB and cannot be specified on import. Thusany value in this attribute is ignored on import.
If you import values into _key, you should make sure they are valid and unique.
When importing data into an edge collection, you should make sure that all importdocuments can from_ and to_ and that their values point to existing documents.
To avoid specifying complete document ids (consisting of collection names and documentkeys) for from_ and to values, there are the options —from-collection-prefix and—to-collection-prefix. If specified, these values will be automatically prependedto each value in __from (or to_ resp.). This allows specifying only document keysinside from and/or __to.
Example
arangoimport --from-collection-prefix users --to-collection-prefix products ...
Importing the following document will then create an edge between users/1234 andproducts/4321:
{ "_from" : "1234", "_to" : "4321", "desc" : "users/1234 is connected to products/4321" }
Updating existing documents
By default, arangoimport will try to insert all documents from the import file into thespecified collection. In case the import file contains documents that are already presentin the target collection (matching is done via the _key attributes), then a defaultarangoimport run will not import these documents and complain about unique key constraintviolations.
However, arangoimport can be used to update or replace existing documents in case theyalready exist in the target collection. It provides the command-line option _—on-duplicate_to control the behavior in case a document is already present in the database.
The default value of —on-duplicate is error. This means that when the import filecontains a document that is present in the target collection already, then trying tore-insert a document with the same _key value is considered an error, and the document inthe database will not be modified.
Other possible values for —on-duplicate are:
- update: each document present in the import file that is also present in the targetcollection already will be updated by arangoimport. update will perform a partial updateof the existing document, modifying only the attributes that are present in the importfile and leaving all other attributes untouched.
The values of system attributes id_, key, __rev, from_ and to_ cannot beupdated or replaced in existing documents.
- replace: each document present in the import file that is also present in the targetcollection already will be replace by arangoimport. replace will replace the existingdocument entirely, resulting in a document with only the attributes specified in the importfile.
The values of system attributes id_, key, __rev, from_ and to_ cannot beupdated or replaced in existing documents.
- ignore: each document present in the import file that is also present in the targetcollection already will be ignored and not modified in the target collection.
When —on-duplicate is set to either update or replace, arangoimport will return thenumber of documents updated/replaced in the updated return value. When set to anothervalue, the value of updated will always be zero. When —on-duplicate is set to ignore,arangoimport will return the number of ignored documents in the ignored return value.When set to another value, ignored will always be zero.
It is possible to perform a combination of inserts and updates/replaces with a singlearangoimport run. When —on-duplicate is set to update or replace, all documents presentin the import file will be inserted into the target collection provided they are validand do not already exist with the specified key_. Documents that are already presentin the target collection (identified by key_ attribute) will instead be updated/replaced.
Result output
An arangoimport import run will print out the final results on the command line.It will show the
- number of documents created (created)
- number of documents updated/replaced (updated/replaced, only non-zero if—on-duplicate was set to update or replace, see below)
- number of warnings or errors that occurred on the server side (warnings/errors)
- number of ignored documents (only non-zero if —on-duplicate was set to ignore).
Example
created: 2
warnings/errors: 0
updated/replaced: 0
ignored: 0
For CSV and TSV imports, the total number of input file lines read will also be printed(lines read).
arangoimport will also print out details about warnings and errors that happened on theserver-side (if any).
Automatic pacing with busy or low throughput disk subsystems
Arangoimport has an automatic pacing algorithm that limits how fastdata is sent to the ArangoDB servers. This pacing algorithm exists toprevent the import operation from failing due to slow responses.
Google Compute and other VM providers limit the throughput of diskdevices. Google’s limit is more strict for smaller disk rentals, thanfor larger. Specifically, a user could choose the smallest disk spaceand be limited to 3 Mbytes per second. Similarly, other users’processes on the shared VM can limit available throughput of the diskdevices.
The automatic pacing algorithm adjusts the transmit block sizedynamically based upon the actual throughput of the server over thelast 20 seconds. Further, each thread delivers its portion of the datain mostly non-overlapping chunks. The thread timing createsintentional windows of non-import activity to allow the server extratime for meta operations.
Automatic pacing intentionally does not use the full throughput of adisk device. An unlimited (really fast) disk device might not needpacing. Raising the number of threads via the —threads X
commandline to any value of X
greater than 2 will increase the totalthroughput used.
Automatic pacing frees the user from adjusting the throughput used tomatch available resources. It is disabled by manually specifying any—batch-size
. 16777216 was the previous default for —batch-size.Having —batch-size too large can lead to transmitted data piling-upon the server, resulting in a TimeoutError.
The pacing algorithm works successfully with MMFiles with diskslimited to read and write throughput as small as 1 Mbyte persecond. The algorithm works successfully with RocksDB with diskslimited to read and write throughput as small as 3 Mbyte per second.