ArangoDB Server Compaction Options (MMFiles)
The ArangoDB MMFiles storage engine will run a compaction over data files.
ArangoDB writes Documents in the WAL file. Once they have been sealed in theWAL file, the collector may copy them into a per collection journal file.
Once journal files fill up, they’re sealed to become data files.
One collection may have documents in the WAL logs, its journal file, and anarbitrary number of data files.
If a collection is loaded, each of these files are opened (thus use a filehandle) and are mmap’ed. Since file handles and memory mapped files are alsoa sparse resource, that number should be kept low.
Once you update or remove documents from data files (or already did while it wasthe journal file) these documents are marked as ‘dead’ with a deletion marker.
Over time the number of dead documents may rise, and we don’t want to use thepreviously mentioned resources, plus the disk space should be given back tothe system. Thus several journal files can be combined to one, omitting thedead documents.
Combining several of these data files into one is called compaction.The compaction process reads the alive documents from the original data files,and writes them into new data file.
Once that is done, the memory mappings to the old data files is released, andthe files are erased.
Since the compaction locks the collection, and also uses I/O resources, itscarefully configurable under which conditions the system should perform whichamount of these compaction jobs:
ArangoDB spawns one compactor thread per database. The settings below varyin scope.
Activity control
The activity control parameters alter the behavior in terms of scan / executionfrequency of the compaction.
Sleep interval between two compaction runs (in seconds):—compaction.db-sleep-time
The number of seconds the collector thread will wait between two attempts tosearch for compactable data files of collections in one Database.If the compactor has actually executed work, a subsequent lookup is done.
Scope: Database.
Minimum sleep time between two compaction runs (in seconds):—compaction.min-interval
When an actual compaction was executed for one collection, we wait for thistime before we execute the compaction on this collection again.This is here to let eventually piled up user load be worked out.
Scope: collection.
Source data files
These parameters control which data files are taken into account for acompaction run. You can specify several criteria which each off may besufficient alone.
The scan over the data files belonging to one collection is executed fromoldest data file to newest; if files qualify for a compaction they may bemerged with newer files (containing younger documents).
Scope: Collection level, some are influenced by collection settings.
Minimal file size threshold original data files have to be below fora compaction:—compaction.min-small-data-file-size
This is the threshold which controls below which minimum total size a data filewill always be taken into account for the compaction.
Minimum unused count of documents in a datafile:—compaction.dead-documents-threshold
Data files will often contain dead documents. This parameter specifies theirtop most accetpeable count until the data file qualifies for compaction.
How many bytes of the source data file are allowed to be unused at most:—compaction.dead-size-threshold
The dead data size varies along with the size of your documents.If you have many big documents, this threshold may hit before the documentcount threshold.
How many percent of the source data file should be unused at least:—compaction.dead-size-percent-threshold
Since the size of the documents may vary this threshold works on the percentage of the dead documents size. Thus, if you have many hugedead documents, this threshold kicks in earlier.
To name an example with numbers, if the data file contains 800 kbytes of aliveand 400 kbytes of dead documents, the share of the dead documents is:
400 / (400 + 800) = 33 %
.
If this value if higher than the specified threshold, the data file willbe compacted.
Compacted target files
Once data files of a collection are qualified for a compaction run, theseparameters control how many data files are merged into one, (or even one sourcedata file may be compacted into one smaller target data file)
Scope: Collection level, some are influenced by collection settings.
Maximum number of files to merge to one file:—compaction.dest-max-files
How many data files (at most) we may merge into one resulting data file duringone compaction run.
How large the resulting file may be in comparison to the collections database.maximal-journal-size
setting:—compaction.dest-max-file-size-factor
In ArangoDB you can configure a default journal file size globally andoverride it on a per collection level. This value controls the size ofcollected data files relative to the configured journal file size of thecollection in question.
A factor of 3 means that the maximum file size of the compacted file is3 times the size of the maximum collection journal file size.
How large may the compaction result file become:—compaction.dest-max-result-file-size
Next to the factor above, a totally maximum allowed file size in bytes maybe specified. This will overrule all previous parameters.