Arangoimport Examples: JSON

Using JSON as data format, records are represented as JSON objects and calleddocuments in ArangoDB. They are self-contained. Therefore, there is no needfor all records in a collection to have the same attribute names or types.Documents can be inhomogeneous while data types can be fully preserved.

Input file formats

arangoimport supports two formats when importing JSON data:

  • JSON – JavaScript Object Notation
  • JSON Lines –also known as JSONL or new-line delimited JSON

Multiple documents can be stored in standard JSON format in a top-level arraywith objects as members:

  1. [
  2. { "_key": "one", "value": 1 },
  3. { "_key": "two", "value": 2 },
  4. { "_key": "foo", "value": "bar" },
  5. ...
  6. ]

This format allows line breaks for formatting (i.e. pretty printing):

  1. [
  2. {
  3. "_key": "one",
  4. "value": 1
  5. },
  6. {
  7. "_key": "two",
  8. "value": 2
  9. },
  10. {
  11. "_key": "foo",
  12. "value": "bar"
  13. },
  14. ...
  15. ]

It requires parsers to read the entire input in order to verify that thearray is properly closed at the very end. arangoimport will need to readthe whole input before it can send the first batch to the server.By default, it will allow importing such files up to a size of about 16 MB.If you want to allow your arangoimport instance to use more memory, increasethe maximum file size by specifying the command-line option —batch-size.For example, to set the batch size to 32 MB, use the following command:

  1. arangoimport --file "data.json" --type json --collection "users" --batch-size 33554432

JSON Lines formatted data allows processing each line individually:

  1. { "_key": "one", "value": 1 }
  2. { "_key": "two", "value": 2 }
  3. { "_key": "foo", "value": "bar" }
  4. ...

The above format can be imported sequentially by arangoimport. It will readdata from the input in chunks and send it in batches to the server. Each batchwill be about as big as specified in the command-line parameter —batch-size.

Please note that you may still need to increase the value of —batch-size if asingle document inside the input file is bigger than the value of —batch-size.

JSON Lines does not allow line breaks for pretty printing. There has to be onecomplete JSON object on each line. A JSON array or primitive value per line isnot supported by arangoimport in contrast to the JSON Lines specification,which allows any valid JSON value on a line.

Converting JSON to JSON Lines

An input with JSON objects in an array, optionally pretty printed, can beeasily converted into JSONL with one JSON object per line using thejq command line tool:

  1. jq -c ".[]" inputFile.json > outputFile.jsonl

The -c option enables compact JSON (as opposed to pretty printed JSON).".[]" is a filter that unpacks the top-level array and effectively puts eachobject in that array on a separate line in combination with the compact option.

An example inputFile.json can look like this:

  1. [
  2. {
  3. "isActive": true,
  4. "name": "Evans Wheeler",
  5. "latitude": -0.119406,
  6. "longitude": 146.271888,
  7. "tags": [
  8. "amet",
  9. "qui",
  10. "velit"
  11. ]
  12. },
  13. {
  14. "isActive": true,
  15. "name": "Coffey Barron",
  16. "latitude": -37.78772,
  17. "longitude": 131.218935,
  18. "tags": [
  19. "dolore",
  20. "exercitation",
  21. "irure",
  22. "velit"
  23. ]
  24. }
  25. ]

The conversion produces the following outputFile.jsonl:

  1. {"isActive":true,"name":"Evans Wheeler","latitude":-0.119406,"longitude":146.271888,"tags":["amet","qui","velit"]}
  2. {"isActive":true,"name":"Coffey Barron","latitude":-37.78772,"longitude":131.218935,"tags":["dolore","exercitation","irure","velit"]}

Import Example and Common Options

We will be using these example user records to import:

  1. { "name" : { "first" : "John", "last" : "Connor" }, "active" : true, "age" : 25, "likes" : [ "swimming"] }
  2. { "name" : { "first" : "Jim", "last" : "O'Brady" }, "age" : 19, "likes" : [ "hiking", "singing" ] }
  3. { "name" : { "first" : "Lisa", "last" : "Jones" }, "dob" : "1981-04-09", "likes" : [ "running" ] }

To import these records, all you need to do is to put them into a file(with one line for each record to import), save it as data.jsonl and runthe following command:

  1. arangoimport --file "data.jsonl" --type jsonl --collection users

This will transfer the data to the server, import the records, and print astatus summary.

To show the intermediate progress during the import process, theoption —progress can be added. This option will show the percentage of theinput file that has been sent to the server. This will only be useful for bigimport files.

  1. arangoimport --file "data.jsonl" --type jsonl --collection users --progress true

It is also possible to use the output of another command as an input forarangoimport. For example, the following shell command can be used to pipedata from the cat process to arangoimport (Linux/Cygwin only):

  1. cat data.json | arangoimport --file - --type jsonl --collection users

In a command line or PowerShell on Windows, there is the type command:

  1. type data.json | arangoimport --file - --type jsonl --collection users

The option —file - with a hyphen as file name is special and makes itread from standard input. No progress can be reported for such imports as thesize of the input will be unknown to arangoimport.

By default, the endpoint tcp://127.0.0.1:8529 will be used. If you want tospecify a different endpoint, you can use the —server.endpoint option. Youprobably want to specify a database user and password as well. You can do so byusing the options —server.username and —server.password. If you do notspecify a password, you will be prompted for one.

  1. arangoimport --server.endpoint tcp://127.0.0.1:8529 --server.username root ...

Note that the collection (users in this case) must already exist or the importwill fail. If you want to create a new collection with the import data, you needto specify the —create-collection option. It will create a document collectionby default and not an edge collection.

  1. arangoimport --file "data.jsonl" --type jsonl --collection users --create-collection true

To create an edge collection instead, use the —create-collection-type optionand set it to edge:

  1. arangoimport --collection myedges --create-collection true --create-collection-type edge ...

When importing data into an existing collection it is often convenient to firstremove all data from the collection and then start the import. This can be achievedby passing the —overwrite parameter to arangoimport. If it is set to true,any existing data in the collection will be removed prior to the import. Notethat any existing index definitions for the collection will be preserved even if—overwrite is set to true.

  1. arangoimport --file "data.jsonl" --type jsonl --collection users --overwrite true

Data gets imported into the specified collection in the default database(_system). To specify a different database, use the —server.databaseoption when invoking arangoimport. If you want to import into a nonexistentdatabase you need to pass —create-database true to create it on-the-fly.

The tool also supports parallel imports, with multiple threads. Using multiplethreads may provide a speedup, especially when using the RocksDB storage engine.To specify the number of parallel threads use the —threads option:

  1. arangoimport --threads 4 --file "data.jsonl" --type jsonl --collection users

Using multiple threads may lead to a non-sequential import of the inputdata. Data that appears later in the input file may be imported earlier than datathat appears earlier in the input file. This is normally not a problem but may causeissues when when there are data dependencies or duplicates in the import data. Inthis case, the number of threads should be set to 1.