- CRUSH Maps
- CRUSH Location
- CRUSH structure
- Modifying the CRUSH map
- Tunables
- argonaut (legacy)
- bobtail (CRUSH_TUNABLES2)
- firefly (CRUSH_TUNABLES3)
- straw_calc_version tunable (introduced with Firefly too)
- hammer (CRUSH_V4)
- jewel (CRUSH_TUNABLES5)
- Which client versions support CRUSH_TUNABLES
- Which client versions support CRUSH_TUNABLES2
- Which client versions support CRUSH_TUNABLES3
- Which client versions support CRUSH_V4
- Which client versions support CRUSH_TUNABLES5
- Warning when tunables are non-optimal
- A few important points
- Tuning CRUSH
- Primary Affinity
CRUSH Maps
The CRUSH algorithmdetermines how to store and retrieve data by computing data storage locations.CRUSH empowers Ceph clients to communicate with OSDs directly rather thanthrough a centralized server or broker. With an algorithmically determinedmethod of storing and retrieving data, Ceph avoids a single point of failure, aperformance bottleneck, and a physical limit to its scalability.
CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomlystore and retrieve data in OSDs with a uniform distribution of data across thecluster. For a detailed discussion of CRUSH, seeCRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data
CRUSH maps contain a list of OSDs, a list of‘buckets’ for aggregating the devices into physical locations, and a list ofrules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. Byreflecting the underlying physical organization of the installation, CRUSH canmodel—and thereby address—potential sources of correlated device failures.Typical sources include physical proximity, a shared power source, and a sharednetwork. By encoding this information into the cluster map, CRUSH placementpolicies can separate object replicas across different failure domains whilestill maintaining the desired distribution. For example, to address thepossibility of concurrent failures, it may be desirable to ensure that datareplicas are on devices using different shelves, racks, power supplies,controllers, and/or physical locations.
When you deploy OSDs they are automatically placed within the CRUSH map under ahost
node named with the hostname for the host they are running on. This,combined with the default CRUSH failure domain, ensures that replicas or erasurecode shards are separated across hosts and a single host failure will notaffect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks,for example, is common for mid- to large-sized clusters.
CRUSH Location
The location of an OSD in terms of the CRUSH map’s hierarchy isreferred to as a crush location
. This location specifier takes theform of a list of key and value pairs describing a position. Forexample, if an OSD is in a particular row, rack, chassis and host, andis part of the ‘default’ CRUSH tree (this is the case for the vastmajority of clusters), its crush location could be described as:
- root=default row=a rack=a2 chassis=a2a host=a2a1
Note:
Note that the order of the keys does not matter.
The key name (left of
=
) must be a valid CRUSHtype
. By defaultthese include root, datacenter, room, row, pod, pdu, rack, chassis and host,but those types can be customized to be anything appropriate by modifyingthe CRUSH map.Not all keys need to be specified. For example, by default, Cephautomatically sets a
ceph-osd
daemon’s location to beroot=default host=HOSTNAME
(based on the output fromhostname -s
).
The crush location for an OSD is normally expressed via the crush location
config option being set in the ceph.conf
file. Each time the OSD starts,it verifies it is in the correct location in the CRUSH map and, if it is not,it moves itself. To disable this automatic CRUSH map management, add thefollowing to your configuration file in the [osd]
section:
- osd crush update on start = false
Custom location hooks
A customized location hook can be used to generate a more completecrush location on startup. The crush location is based on, in orderof preference:
A
crush location
option in ceph.conf.A default of
root=default host=HOSTNAME
where the hostname isgenerated with thehostname -s
command.
This is not useful by itself, as the OSD itself has the exact samebehavior. However, a script can be written to provide additionallocation fields (for example, the rack or datacenter), and then thehook enabled via the config option:
- crush location hook = /path/to/customized-ceph-crush-location
This hook is passed several arguments (below) and should output a single lineto stdout with the CRUSH location description.:
- --cluster CLUSTER --id ID --type TYPE
where the cluster name is typically ‘ceph’, the id is the daemonidentifier (e.g., the OSD number or daemon identifier), and the daemontype is osd
, mds
, or similar.
For example, a simple hook that additionally specified a rack locationbased on a hypothetical file /etc/rack
might be:
- #!/bin/sh
- echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default"
CRUSH structure
The CRUSH map consists of, loosely speaking, a hierarchy describingthe physical topology of the cluster, and a set of rules definingpolicy about how we place data on those devices. The hierarchy hasdevices (ceph-osd
daemons) at the leaves, and internal nodescorresponding to other physical features or groupings: hosts, racks,rows, datacenters, and so on. The rules describe how replicas areplaced in terms of that hierarchy (e.g., ‘three replicas in differentracks’).
Devices
Devices are individual ceph-osd
daemons that can store data. Youwill normally have one defined here for each OSD daemon in yourcluster. Devices are identified by an id (a non-negative integer) anda name, normally osd.N
where N
is the device id.
Devices may also have a device class associated with them (e.g.,hdd
or ssd
), allowing them to be conveniently targeted by acrush rule.
Types and Buckets
A bucket is the CRUSH term for internal nodes in the hierarchy: hosts,racks, rows, etc. The CRUSH map defines a series of types that areused to describe these nodes. By default, these types include:
osd (or device)
host
chassis
rack
row
pdu
pod
room
datacenter
zone
region
root
Most clusters make use of only a handful of these types, and otherscan be defined as needed.
The hierarchy is built with devices (normally type osd
) at theleaves, interior nodes with non-device types, and a root node of typeroot
. For example,
Each node (device or bucket) in the hierarchy has a _weight_associated with it, indicating the relative proportion of the totaldata that device or hierarchy subtree should store. Weights are setat the leaves, indicating the size of the device, and automaticallysum up the tree from there, such that the weight of the default nodewill be the total of all devices contained beneath it. Normallyweights are in units of terabytes (TB).
You can get a simple view the CRUSH hierarchy for your cluster,including the weights, with:
- ceph osd crush tree
Rules
Rules define policy about how data is distributed across the devicesin the hierarchy.
CRUSH rules define placement and replication strategies ordistribution policies that allow you to specify exactly how CRUSHplaces object replicas. For example, you might create a rule selectinga pair of targets for 2-way mirroring, another rule for selectingthree targets in two different data centers for 3-way mirroring, andyet another rule for erasure coding over six storage devices. For adetailed discussion of CRUSH rules, refer to CRUSH - Controlled,Scalable, Decentralized Placement of Replicated Data, and morespecifically to Section 3.2.
In almost all cases, CRUSH rules can be created via the CLI byspecifying the pool type they will be used for (replicated orerasure coded), the failure domain, and optionally a device class.In rare cases rules must be written by hand by manually editing theCRUSH map.
You can see what rules are defined for your cluster with:
- ceph osd crush rule ls
You can view the contents of the rules with:
- ceph osd crush rule dump
Device classes
Each device can optionally have a class associated with it. Bydefault, OSDs automatically set their class on startup to eitherhdd, ssd, or nvme based on the type of device they are backedby.
The device class for one or more OSDs can be explicitly set with:
- ceph osd crush set-device-class <class> <osd-name> [...]
Once a device class is set, it cannot be changed to another classuntil the old class is unset with:
- ceph osd crush rm-device-class <osd-name> [...]
This allows administrators to set device classes without the classbeing changed on OSD restart or by some other script.
A placement rule that targets a specific device class can be created with:
- ceph osd crush rule create-replicated <rule-name> <root> <failure-domain> <class>
A pool can then be changed to use the new rule with:
- ceph osd pool set <pool-name> crush_rule <rule-name>
Device classes are implemented by creating a “shadow” CRUSH hierarchyfor each device class in use that contains only devices of that class.Rules can then distribute data over the shadow hierarchy. One nicething about this approach is that it is fully backward compatible withold Ceph clients. You can view the CRUSH hierarchy with shadow itemswith:
- ceph osd crush tree --show-shadow
For older clusters created before Luminous that relied on manuallycrafted CRUSH maps to maintain per-device-type hierarchies, there is areclassify tool available to help transition to device classeswithout triggering data movement (see Migrating from a legacy SSD rule to device classes).
Weights sets
A weight set is an alternative set of weights to use whencalculating data placement. The normal weights associated with eachdevice in the CRUSH map are set based on the device size and indicatehow much data we should be storing where. However, because CRUSH isbased on a pseudorandom placement process, there is always somevariation from this ideal distribution, the same way that rolling adice sixty times will not result in rolling exactly 10 ones and 10sixes. Weight sets allow the cluster to do a numerical optimizationbased on the specifics of your cluster (hierarchy, pools, etc.) to achievea balanced distribution.
There are two types of weight sets supported:
A compat weight set is a single alternative set of weights foreach device and node in the cluster. This is not well-suited forcorrecting for all anomalies (for example, placement groups fordifferent pools may be different sizes and have different loadlevels, but will be mostly treated the same by the balancer).However, compat weight sets have the huge advantage that they arebackward compatible with previous versions of Ceph, which meansthat even though weight sets were first introduced in Luminousv12.2.z, older clients (e.g., firefly) can still connect to thecluster when a compat weight set is being used to balance data.
A per-pool weight set is more flexible in that it allowsplacement to be optimized for each data pool. Additionally,weights can be adjusted for each position of placement, allowingthe optimizer to correct for a subtle skew of data toward deviceswith small weights relative to their peers (and effect that isusually only apparently in very large clusters but which can causebalancing problems).
When weight sets are in use, the weights associated with each node inthe hierarchy is visible as a separate column (labeled either(compat)
or the pool name) from the command:
- ceph osd crush tree
When both compat and per-pool weight sets are in use, dataplacement for a particular pool will use its own per-pool weight setif present. If not, it will use the compat weight set if present. Ifneither are present, it will use the normal CRUSH weights.
Although weight sets can be set up and manipulated by hand, it isrecommended that the balancer module be enabled to do soautomatically.
Modifying the CRUSH map
Add/Move an OSD
To add or move an OSD in the CRUSH map of a running cluster:
- ceph osd crush set {name} {weight} root={root} [{bucket-type}={bucket-name} ...]
Where:
name
- Description
The full name of the OSD.
Type
String
Required
Yes
Example
osd.0
weight
- Description
The CRUSH weight for the OSD, normally its size measure in terabytes (TB).
Type
Double
Required
Yes
Example
2.0
root
- Description
The root node of the tree in which the OSD resides (normally
default
)Type
Key/value pair.
Required
Yes
Example
root=default
bucket-type
- Description
You may specify the OSD’s location in the CRUSH hierarchy.
Type
Key/value pairs.
Required
No
Example
datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
The following example adds osd.0
to the hierarchy, or moves theOSD from a previous location.
- ceph osd crush set osd.0 1.0 root=default datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
Adjust OSD weight
To adjust an OSD’s crush weight in the CRUSH map of a running cluster, executethe following:
- ceph osd crush reweight {name} {weight}
Where:
name
- Description
The full name of the OSD.
Type
String
Required
Yes
Example
osd.0
weight
- Description
The CRUSH weight for the OSD.
Type
Double
Required
Yes
Example
2.0
Remove an OSD
To remove an OSD from the CRUSH map of a running cluster, execute thefollowing:
- ceph osd crush remove {name}
Where:
name
- Description
The full name of the OSD.
Type
String
Required
Yes
Example
osd.0
Add a Bucket
To add a bucket in the CRUSH map of a running cluster, execute theceph osd crush add-bucket
command:
- ceph osd crush add-bucket {bucket-name} {bucket-type}
Where:
bucket-name
- Description
The full name of the bucket.
Type
String
Required
Yes
Example
rack12
bucket-type
- Description
The type of the bucket. The type must already exist in the hierarchy.
Type
String
Required
Yes
Example
rack
The following example adds the rack12
bucket to the hierarchy:
- ceph osd crush add-bucket rack12 rack
Move a Bucket
To move a bucket to a different location or position in the CRUSH maphierarchy, execute the following:
- ceph osd crush move {bucket-name} {bucket-type}={bucket-name}, [...]
Where:
bucket-name
- Description
The name of the bucket to move/reposition.
Type
String
Required
Yes
Example
foo-bar-1
bucket-type
- Description
You may specify the bucket’s location in the CRUSH hierarchy.
Type
Key/value pairs.
Required
No
Example
datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-1
Remove a Bucket
To remove a bucket from the CRUSH map hierarchy, execute the following:
- ceph osd crush remove {bucket-name}
Note
A bucket must be empty before removing it from the CRUSH hierarchy.
Where:
bucket-name
- Description
The name of the bucket that you’d like to remove.
Type
String
Required
Yes
Example
rack12
The following example removes the rack12
bucket from the hierarchy:
- ceph osd crush remove rack12
Creating a compat weight set
To create a compat weight set:
- ceph osd crush weight-set create-compat
Weights for the compat weight set can be adjusted with:
- ceph osd crush weight-set reweight-compat {name} {weight}
The compat weight set can be destroyed with:
- ceph osd crush weight-set rm-compat
Creating per-pool weight sets
To create a weight set for a specific pool,:
- ceph osd crush weight-set create {pool-name} {mode}
Note
Per-pool weight sets require that all servers and daemonsrun Luminous v12.2.z or later.
Where:
pool-name
- Description
The name of a RADOS pool
Type
String
Required
Yes
Example
rbd
mode
- Description
Either
flat
orpositional
. A flat weight sethas a single weight for each device or bucket. Apositional weight set has a potentially differentweight for each position in the resulting placementmapping. For example, if a pool has a replica count of3, then a positional weight set will have three weightsfor each device and bucket.Type
String
Required
Yes
Example
flat
To adjust the weight of an item in a weight set:
- ceph osd crush weight-set reweight {pool-name} {item-name} {weight [...]}
To list existing weight sets,:
- ceph osd crush weight-set ls
To remove a weight set,:
- ceph osd crush weight-set rm {pool-name}
Creating a rule for a replicated pool
For a replicated pool, the primary decision when creating the CRUSHrule is what the failure domain is going to be. For example, if afailure domain of host
is selected, then CRUSH will ensure thateach replica of the data is stored on a different host. If rack
is selected, then each replica will be stored in a different rack.What failure domain you choose primarily depends on the size of yourcluster and how your hierarchy is structured.
Normally, the entire cluster hierarchy is nested beneath a root nodenamed default
. If you have customized your hierarchy, you maywant to create a rule nested at some other node in the hierarchy. Itdoesn’t matter what type is associated with that node (it doesn’t haveto be a root
node).
It is also possible to create a rule that restricts data placement toa specific class of device. By default, Ceph OSDs automaticallyclassify themselves as either hdd
or ssd
, depending on theunderlying type of device being used. These classes can also becustomized.
To create a replicated rule,:
- ceph osd crush rule create-replicated {name} {root} {failure-domain-type} [{class}]
Where:
name
- Description
The name of the rule
Type
String
Required
Yes
Example
rbd-rule
root
- Description
The name of the node under which data should be placed.
Type
String
Required
Yes
Example
default
failure-domain-type
- Description
The type of CRUSH nodes across which we should separate replicas.
Type
String
Required
Yes
Example
rack
class
- Description
The device class data should be placed on.
Type
String
Required
No
Example
ssd
Creating a rule for an erasure coded pool
For an erasure-coded pool, the same basic decisions need to be made aswith a replicated pool: what is the failure domain, what node in thehierarchy will data be placed under (usually default
), and willplacement be restricted to a specific device class. Erasure codepools are created a bit differently, however, because they need to beconstructed carefully based on the erasure code being used. For this reason,you must include this information in the erasure code profile. A CRUSHrule will then be created from that either explicitly or automatically whenthe profile is used to create a pool.
The erasure code profiles can be listed with:
- ceph osd erasure-code-profile ls
An existing profile can be viewed with:
- ceph osd erasure-code-profile get {profile-name}
Normally profiles should never be modified; instead, a new profileshould be created and used when creating a new pool or creating a newrule for an existing pool.
An erasure code profile consists of a set of key=value pairs. Most ofthese control the behavior of the erasure code that is encoding datain the pool. Those that begin with crush-
, however, affect theCRUSH rule that is created.
The erasure code profile properties of interest are:
crush-root: the name of the CRUSH node to place data under [default:
default
].crush-failure-domain: the CRUSH type to separate erasure-coded shards across [default:
host
].crush-device-class: the device class to place data on [default: none, meaning all devices are used].
k and m (and, for the
lrc
plugin, l): these determine the number of erasure code shards, affecting the resulting CRUSH rule.
Once a profile is defined, you can create a CRUSH rule with:
- ceph osd crush rule create-erasure {name} {profile-name}
Deleting rules
Rules that are not in use by pools can be deleted with:
- ceph osd crush rule rm {rule-name}
Tunables
Over time, we have made (and continue to make) improvements to theCRUSH algorithm used to calculate the placement of data. In order tosupport the change in behavior, we have introduced a series of tunableoptions that control whether the legacy or improved variation of thealgorithm is used.
In order to use newer tunables, both clients and servers must supportthe new version of CRUSH. For this reason, we have createdprofiles
that are named after the Ceph version in which they wereintroduced. For example, the firefly
tunables are first supportedin the firefly release, and will not work with older (e.g., dumpling)clients. Once a given set of tunables are changed from the legacydefault behavior, the ceph-mon
and ceph-osd
will prevent olderclients who do not support the new CRUSH features from connecting tothe cluster.
argonaut (legacy)
The legacy CRUSH behavior used by argonaut and older releases worksfine for most clusters, provided there are not too many OSDs that havebeen marked out.
bobtail (CRUSH_TUNABLES2)
The bobtail tunable profile fixes a few key misbehaviors:
For hierarchies with a small number of devices in the leaf buckets,some PGs map to fewer than the desired number of replicas. Thiscommonly happens for hierarchies with “host” nodes with a smallnumber (1-3) of OSDs nested beneath each one.
For large clusters, some small percentages of PGs map to less thanthe desired number of OSDs. This is more prevalent when there areseveral layers of the hierarchy (e.g., row, rack, host, osd).
When some OSDs are marked out, the data tends to get redistributedto nearby OSDs instead of across the entire hierarchy.
The new tunables are:
choose_local_tries
: Number of local retries. Legacy value is2, optimal value is 0.
choose_local_fallback_tries
: Legacy value is 5, optimal valueis 0.
choose_total_tries
: Total number of attempts to choose an item.Legacy value was 19, subsequent testing indicates that a value of50 is more appropriate for typical clusters. For extremely largeclusters, a larger value might be necessary.
chooseleaf_descend_once
: Whether a recursive chooseleaf attemptwill retry, or only try once and allow the original placement toretry. Legacy default is 0, optimal value is 1.
Migration impact:
Moving from argonaut to bobtail tunables triggers a moderate amountof data movement. Use caution on a cluster that is alreadypopulated with data.
firefly (CRUSH_TUNABLES3)
The firefly tunable profile fixes a problemwith the chooseleaf
CRUSH rule behavior that tends to result in PGmappings with too few results when too many OSDs have been marked out.
The new tunable is:
chooseleaf_vary_r
: Whether a recursive chooseleaf attempt willstart with a non-zero value of r, based on how many attempts theparent has already made. Legacy default is 0, but with this valueCRUSH is sometimes unable to find a mapping. The optimal value (interms of computational cost and correctness) is 1.
Migration impact:
For existing clusters that have lots of existing data, changingfrom 0 to 1 will cause a lot of data to move; a value of 4 or 5will allow CRUSH to find a valid mapping but will make less datamove.
straw_calc_version tunable (introduced with Firefly too)
There were some problems with the internal weights calculated andstored in the CRUSH map for straw
buckets. Specifically, whenthere were items with a CRUSH weight of 0 or both a mix of weights andsome duplicated weights CRUSH would distribute data incorrectly (i.e.,not in proportion to the weights).
The new tunable is:
straw_calc_version
: A value of 0 preserves the old, brokeninternal weight calculation; a value of 1 fixes the behavior.
Migration impact:
Moving to straw_calc_version 1 and then adjusting a straw bucket(by adding, removing, or reweighting an item, or by using thereweight-all command) can trigger a small to moderate amount ofdata movement if the cluster has hit one of the problematicconditions.
This tunable option is special because it has absolutely no impactconcerning the required kernel version in the client side.
hammer (CRUSH_V4)
The hammer tunable profile does not affect themapping of existing CRUSH maps simply by changing the profile. However:
There is a new bucket type (
straw2
) supported. The newstraw2
bucket type fixes several limitations in the originalstraw
bucket. Specifically, the oldstraw
buckets wouldchange some mappings that should have changed when a weight wasadjusted, whilestraw2
achieves the original goal of onlychanging mappings to or from the bucket item whose weight haschanged.
straw2
is the default for any newly created buckets.
Migration impact:
Changing a bucket type from
straw
tostraw2
will result ina reasonably small amount of data movement, depending on how muchthe bucket item weights vary from each other. When the weights areall the same no data will move, and when item weights varysignificantly there will be more movement.
jewel (CRUSH_TUNABLES5)
The jewel tunable profile improves theoverall behavior of CRUSH such that significantly fewer mappingschange when an OSD is marked out of the cluster.
The new tunable is:
chooseleaf_stable
: Whether a recursive chooseleaf attempt willuse a better value for an inner loop that greatly reduces the numberof mapping changes when an OSD is marked out. The legacy value is 0,while the new value of 1 uses the new approach.
Migration impact:
Changing this value on an existing cluster will result in a verylarge amount of data movement as almost every PG mapping is likelyto change.
Which client versions support CRUSH_TUNABLES
argonaut series, v0.48.1 or later
v0.49 or later
Linux kernel version v3.6 or later (for the file system and RBD kernel clients)
Which client versions support CRUSH_TUNABLES2
v0.55 or later, including bobtail series (v0.56.x)
Linux kernel version v3.9 or later (for the file system and RBD kernel clients)
Which client versions support CRUSH_TUNABLES3
v0.78 (firefly) or later
Linux kernel version v3.15 or later (for the file system and RBD kernel clients)
Which client versions support CRUSH_V4
v0.94 (hammer) or later
Linux kernel version v4.1 or later (for the file system and RBD kernel clients)
Which client versions support CRUSH_TUNABLES5
v10.0.2 (jewel) or later
Linux kernel version v4.5 or later (for the file system and RBD kernel clients)
Warning when tunables are non-optimal
Starting with version v0.74, Ceph will issue a health warning if thecurrent CRUSH tunables don’t include all the optimal values from thedefault
profile (see below for the meaning of the default
profile).To make this warning go away, you have two options:
- Adjust the tunables on the existing cluster. Note that this willresult in some data movement (possibly as much as 10%). This is thepreferred route, but should be taken with care on a production clusterwhere the data movement may affect performance. You can enable optimaltunables with:
- ceph osd crush tunables optimal
If things go poorly (e.g., too much load) and not very muchprogress has been made, or there is a client compatibility problem(old kernel cephfs or rbd clients, or pre-bobtail libradosclients), you can switch back with:
- ceph osd crush tunables legacy
- You can make the warning go away without making any changes to CRUSH byadding the following option to your ceph.conf
[mon]
section:
- mon warn on legacy crush tunables = false
For the change to take effect, you will need to restart the monitors, orapply the option to running monitors with:
- ceph tell mon.\* config set mon_warn_on_legacy_crush_tunables false
A few important points
Adjusting these values will result in the shift of some PGs betweenstorage nodes. If the Ceph cluster is already storing a lot ofdata, be prepared for some fraction of the data to move.
The
ceph-osd
andceph-mon
daemons will start requiring thefeature bits of new connections as soon as they getthe updated map. However, already-connected clients areeffectively grandfathered in, and will misbehave if they do notsupport the new feature.If the CRUSH tunables are set to non-legacy values and then laterchanged back to the default values,
ceph-osd
daemons will not berequired to support the feature. However, the OSD peering processrequires examining and understanding old maps. Therefore, youshould not run old versions of theceph-osd
daemonif the cluster has previously used non-legacy CRUSH values, even ifthe latest version of the map has been switched back to using thelegacy defaults.
Tuning CRUSH
The simplest way to adjust the crush tunables is by changing to a knownprofile. Those are:
legacy
: the legacy behavior from argonaut and earlier.
argonaut
: the legacy values supported by the original argonaut release
bobtail
: the values supported by the bobtail release
firefly
: the values supported by the firefly release
hammer
: the values supported by the hammer release
jewel
: the values supported by the jewel release
optimal
: the best (ie optimal) values of the current version of Ceph
default
: the default values of a new cluster installed fromscratch. These values, which depend on the current version of Ceph,are hard coded and are generally a mix of optimal and legacy values.These values generally match theoptimal
profile of the previousLTS release, or the most recent release for which we generally exceptmore users to have up to date clients for.
You can select a profile on a running cluster with the command:
- ceph osd crush tunables {PROFILE}
Note that this may result in some data movement.
Primary Affinity
When a Ceph Client reads or writes data, it always contacts the primary OSD inthe acting set. For set [2, 3, 4]
, osd.2
is the primary. Sometimes anOSD is not well suited to act as a primary compared to other OSDs (e.g., it hasa slow disk or a slow controller). To prevent performance bottlenecks(especially on read operations) while maximizing utilization of your hardware,you can set a Ceph OSD’s primary affinity so that CRUSH is less likely to usethe OSD as a primary in an acting set.
- ceph osd primary-affinity <osd-id> <weight>
Primary affinity is 1
by default (i.e., an OSD may act as a primary). Youmay set the OSD primary range from 0-1
, where 0
means that the OSD mayNOT be used as a primary and 1
means that an OSD may be used as aprimary. When the weight is < 1
, it is less likely that CRUSH will selectthe Ceph OSD Daemon to act as a primary.