crushtool – CRUSH map manipulation tool
Synopsis
crushtool ( -d map | -c map.txt | –build –numosds numosds_layer1… | –test ) [ -o outfile ]
Description
crushtool is a utility that lets you create, compile, decompileand test CRUSH map files.
CRUSH is a pseudo-random data distribution algorithm that efficientlymaps input values (which, in the context of Ceph, correspond to PlacementGroups) across a heterogeneous, hierarchically structured device map.The algorithm was originally described in detail in the following paper(although it has evolved some since then):
- http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf
The tool has four modes of operation.
—decompile|-d
map
- will take the compiled map and decompile it into a plaintext sourcefile, suitable for editing.
—build
—num_osds {num-osds} layer1 …
- will create map with the given layer structure. See below for adetailed explanation.
—test
- will perform a dry run of a CRUSH mapping for a range of inputvalues
[—min-x,—max-x]
(default[0,1023]
) which can bethought of as simulated Placement Groups. See below for a moredetailed explanation.
Unlike other Ceph tools, crushtool does not accept generic optionssuch as –debug-crush from the command line. They can, however, beprovided via the CEPH_ARGS environment variable. For instance, tosilence all output from the CRUSH subsystem:
- CEPH_ARGS="--debug-crush 0" crushtool ...
Running tests with –test
The test mode will use the input crush map ( as specified with -imap ) and perform a dry run of CRUSH mapping or random placement(if –simulate is set ). On completion, two kinds of reports can becreated.1) The –show-… option outputs human readable informationon stderr.2) The –output-csv option creates CSV files that aredocumented by the –help-output option.
Note: Each Placement Group (PG) has an integer ID which can be obtainedfrom ceph pg dump
(for example PG 2.2f means pool id 2, PG id 32).The pool and PG IDs are combined by a function to get a value which isgiven to CRUSH to map it to OSDs. crushtool does not know about PGs orpools; it only runs simulations by mapping values in the range[—min-x,—max-x]
.
- rule 1 (metadata) num_rep 5 result size == 5: 1024/1024
shows that rule 1 which is named metadata successfullymapped 1024 values to result size == 5 devices when tryingto map them to num_rep 5 replicas. When it fails to provide therequired mapping, presumably because the number of tries mustbe increased, a breakdown of the failures is displayed. For instance:
- rule 1 (metadata) num_rep 10 result size == 8: 4/1024
- rule 1 (metadata) num_rep 10 result size == 9: 93/1024
- rule 1 (metadata) num_rep 10 result size == 10: 927/1024
shows that although num_rep 10 replicas were required, 4out of 1024 values ( 4/1024 ) were mapped to result size== 8 devices only.
- CRUSH rule 1 x 24 [11,6]
shows that value 24 is mapped to devices [11,6] by rule1.
—show-bad-mappings
- Displays which value failed to be mapped to the required number ofdevices. For instance:
- bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]
shows that when rule 1 was required to map 7 devices, itcould map only six : [8,10,2,11,6,9].
—show-utilization
- Displays the expected and actual utilization for each device, foreach number of replicas. For instance:
- device 0: stored : 951 expected : 853.333
- device 1: stored : 963 expected : 853.333
- ...
shows that device 0 stored 951 values and was expected to store 853.Implies –show-statistics.
—show-utilization-all
- Displays the same as –show-utilization but does not suppressoutput when the weight of a device is zero.Implies –show-statistics.
- 0: 95224
- 1: 3745
- 2: 2225
- ..
shows that 95224 mappings succeeded without retries, 3745mappings succeeded with one attempts, etc. There are as many rowsas the value of the –set-choose-total-tries option.
—output-csv
- Creates CSV files (in the current directory) containing informationdocumented by –help-output. The files are named after the ruleused when collecting the statistics. For instance, if the rule: ‘metadata’ is used, the CSV files will be:
- metadata-absolute_weights.csv
- metadata-device_utilization.csv
- ...
The first line of the file shortly explains the column layout. Forinstance:
- metadata-absolute_weights.csv
- Device ID, Absolute Weight
- 0,1
- ...
—output-name
NAME
- Prepend NAME to the file names generated when –output-csvis specified. For instance –output-name FOO will createfiles:
- FOO-metadata-absolute_weights.csv
- FOO-metadata-device_utilization.csv
- ...
The –set-… options can be used to modify the tunables of theinput crush map. The input crush map is modified inmemory. For example:
- $ crushtool -i mymap --test --show-bad-mappings
- bad mapping rule 1 x 781 num_rep 7 result [8,10,2,11,6,9]
could be fixed by increasing the choose-total-tries as follows:
- $ crushtool -i mymap –test
–show-bad-mappings –set-choose-total-tries 500
Building a map with –build
The build mode will generate hierarchical maps. The first argumentspecifies the number of devices (leaves) in the CRUSH hierarchy. Eachlayer describes how the layer (or devices) preceding it should begrouped.
Each layer consists of:
- bucket ( uniform | list | tree | straw | straw2 ) size
The bucket is the type of the buckets in the layer(e.g. “rack”). Each bucket name will be built by appending a uniquenumber to the bucket string (e.g. “rack0”, “rack1”…).
The second component is the type of bucket: straw should be usedmost of the time.
The third component is the maximum size of the bucket. A size of zeromeans a bucket of infinite capacity.
Example
Suppose we have two rows with two racks each and 20 nodes per rack. Supposeeach node contains 4 storage devices for Ceph OSD Daemons. This configurationallows us to deploy 320 Ceph OSD Daemons. Lets assume a 42U rack with 2U nodes,leaving an extra 2U for a rack switch.
To reflect our hierarchy of devices, nodes, racks and rows, we would executethe following:
- $ crushtool -o crushmap --build --num_osds 320 \
- node straw 4 \
- rack straw 20 \
- row straw 2 \
- root straw 0
- # id weight type name reweight
- -87 320 root root
- -85 160 row row0
- -81 80 rack rack0
- -1 4 node node0
- 0 1 osd.0 1
- 1 1 osd.1 1
- 2 1 osd.2 1
- 3 1 osd.3 1
- -2 4 node node1
- 4 1 osd.4 1
- 5 1 osd.5 1
- ...
CRUSH rules are created so the generated crushmap can betested. They are the same rules as the ones created by default whencreating a new Ceph cluster. They can be further edited with:
- # decompile
- crushtool -d crushmap -o map.txt
- # edit
- emacs map.txt
- # recompile
- crushtool -c map.txt -o crushmap
Reclassify
The reclassify function allows users to transition from older maps thatmaintain parallel hierarchies for OSDs of different types to a modern CRUSHmap that makes use of the device class feature. For more information,see http://docs.ceph.com/docs/master/rados/operations/crush-map-edits/#migrating-from-a-legacy-ssd-rule-to-device-classes.
Example output from –test
See https://github.com/ceph/ceph/blob/master/src/test/cli/crushtool/set-choose.tfor sample crushtool —test
commands and output produced thereby.
Availability
crushtool is part of Ceph, a massively scalable, open-source, distributed storage system. Pleaserefer to the Ceph documentation at http://ceph.com/docs for moreinformation.
See also
ceph(8),osdmaptool(8),
Authors
John Wilkins, Sage Weil, Loic Dachary