Placement Groups

Autoscaling placement groups

Placement groups (PGs) are an internal implementation detail of howCeph distributes data. You can allow the cluster to either makerecommendations or automatically tune PGs based on how the cluster isused by enabling pg-autoscaling.

Each pool in the system has a pg_autoscale_mode property that can be set to off, on, or warn.

  • off: Disable autoscaling for this pool. It is up to the administrator to choose an appropriate PG number for each pool. Please refer to Choosing the number of Placement Groups for more information.

  • on: Enable automated adjustments of the PG count for the given pool.

  • warn: Raise health alerts when the PG count should be adjusted

To set the autoscaling mode for existing pools,:

  1. ceph osd pool set <pool-name> pg_autoscale_mode <mode>

For example to enable autoscaling on pool foo,:

  1. ceph osd pool set foo pg_autoscale_mode on

You can also configure the default pg_autoscale_mode that isapplied to any pools that are created in the future with:

  1. ceph config set global osd_pool_default_pg_autoscale_mode <mode>

Viewing PG scaling recommendations

You can view each pool, its relative utilization, and any suggested changes tothe PG count with this command:

  1. ceph osd pool autoscale-status

Output will be something like:

  1. POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO PG_NUM NEW PG_NUM AUTOSCALE
  2. a 12900M 3.0 82431M 0.4695 8 128 warn
  3. c 0 3.0 82431M 0.0000 0.2000 1 64 warn
  4. b 0 953.6M 3.0 82431M 0.0347 8 warn

SIZE is the amount of data stored in the pool. TARGET SIZE, ifpresent, is the amount of data the administrator has specified thatthey expect to eventually be stored in this pool. The system usesthe larger of the two values for its calculation.

RATE is the multiplier for the pool that determines how much rawstorage capacity is consumed. For example, a 3 replica pool willhave a ratio of 3.0, while a k=4,m=2 erasure coded pool will have aratio of 1.5.

RAW CAPACITY is the total amount of raw storage capacity on theOSDs that are responsible for storing this pool’s (and perhaps otherpools’) data. RATIO is the ratio of that total capacity thatthis pool is consuming (i.e., ratio = size * rate / raw capacity).

TARGET RATIO, if present, is the ratio of storage that theadministrator has specified that they expect this pool to consume.The system uses the larger of the actual ratio and the target ratiofor its calculation. If both target size bytes and ratio are specified, theratio takes precedence.

PG_NUM is the current number of PGs for the pool (or the currentnumber of PGs that the pool is working towards, if a pg_numchange is in progress). NEW PG_NUM, if present, is what thesystem believes the pool’s pg_num should be changed to. It isalways a power of 2, and will only be present if the “ideal” valuevaries from the current value by more than a factor of 3.

The final column, AUTOSCALE, is the pool pg_autoscale_mode,and will be either on, off, or warn.

Automated scaling

Allowing the cluster to automatically scale PGs based on usage is thesimplest approach. Ceph will look at the total available storage andtarget number of PGs for the whole system, look at how much data isstored in each pool, and try to apportion the PGs accordingly. Thesystem is relatively conservative with its approach, only makingchanges to a pool when the current number of PGs (pg_num) is morethan 3 times off from what it thinks it should be.

The target number of PGs per OSD is based on themon_target_pg_per_osd configurable (default: 100), which can beadjusted with:

  1. ceph config set global mon_target_pg_per_osd 100

The autoscaler analyzes pools and adjusts on a per-subtree basis.Because each pool may map to a different CRUSH rule, and each rule maydistribute data across different devices, Ceph will considerutilization of each subtree of the hierarchy independently. Forexample, a pool that maps to OSDs of class ssd and a pool that mapsto OSDs of class hdd will each have optimal PG counts that depend onthe number of those respective device types.

Specifying expected pool size

When a cluster or pool is first created, it will consume a smallfraction of the total cluster capacity and will appear to the systemas if it should only need a small number of placement groups.However, in most cases cluster administrators have a good idea whichpools are expected to consume most of the system capacity over time.By providing this information to Ceph, a more appropriate number ofPGs can be used from the beginning, preventing subsequent changes inpg_num and the overhead associated with moving data around whenthose adjustments are made.

The target size* of a pool can be specified in two ways: either interms of the absolute size of the pool (i.e., bytes), or as a ratio ofthe total cluster capacity.

For example,:

  1. ceph osd pool set mypool target_size_bytes 100T

will tell the system that mypool is expected to consume 100 TiB ofspace. Alternatively,:

  1. ceph osd pool set mypool target_size_ratio .9

will tell the system that mypool is expected to consume 90% of thetotal cluster capacity.

You can also set the target size of a pool at creation time with the optional —target-size-bytes <bytes> or —target-size-ratio <ratio> arguments to the ceph osd pool create command.

Note that if impossible target size values are specified (for example,a capacity larger than the total cluster, or ratio(s) that sum to morethan 1.0) then a health warning(POOL_TARET_SIZE_RATIO_OVERCOMMITTED orPOOL_TARGET_SIZE_BYTES_OVERCOMMITTED) will be raised.

Specifying bounds on a pool’s PGs

It is also possible to specify a minimum number of PGs for a pool.This is useful for establishing a lower bound on the amount ofparallelism client will see when doing IO, even when a pool is mostlyempty. Setting the lower bound prevents Ceph from reducing (orrecommending you reduce) the PG number below the configured number.

You can set the minimum number of PGs for a pool with:

  1. ceph osd pool set <pool-name> pg_num_min <num>

You can also specify the minimum PG count at pool creation time withthe optional —pg-num-min <num> argument to the ceph osd poolcreate command.

A preselection of pg_num

When creating a new pool with:

  1. ceph osd pool create {pool-name} [pg_num]

it is optional to choose the value of pg_num. If you do notspecify pg_num, the cluster can (by default) automatically tune itfor you based on how much data is stored in the pool (see above, Autoscaling placement groups).

Alternatively, pg_num can be explicitly provided. However,whether you specify a pg_num value or not does not affect whetherthe value is automatically tuned by the cluster after the fact. Toenable or disable auto-tuning,:

  1. ceph osd pool set {pool-name} pg_autoscale_mode (on|off|warn)

The “rule of thumb” for PGs per OSD has traditionally be 100. Withthe additional of the balancer (which is also enabled by default), avalue of more like 50 PGs per OSD is probably reasonable. Thechallenge (which the autoscaler normally does for you), is to:

  • have the PGs per pool proportional to the data in the pool, and

  • end up with 50-100 PGs per OSDs, after the replication orerasuring-coding fan-out of each PG across OSDs is taken intoconsideration

How are Placement Groups used ?

A placement group (PG) aggregates objects within a pool becausetracking object placement and object metadata on a per-object basis iscomputationally expensive–i.e., a system with millions of objectscannot realistically track placement on a per-object basis.

Placement Groups - 图1

The Ceph client will calculate which placement group an object shouldbe in. It does this by hashing the object ID and applying an operationbased on the number of PGs in the defined pool and the ID of the pool.See Mapping PGs to OSDs for details.

The object’s contents within a placement group are stored in a set ofOSDs. For instance, in a replicated pool of size two, each placementgroup will store objects on two OSDs, as shown below.

Placement Groups - 图2

Should OSD #2 fail, another will be assigned to Placement Group #1 andwill be filled with copies of all objects in OSD #1. If the pool sizeis changed from two to three, an additional OSD will be assigned tothe placement group and will receive copies of all objects in theplacement group.

Placement groups do not own the OSD; they share it with otherplacement groups from the same pool or even other pools. If OSD #2fails, the Placement Group #2 will also have to restore copies ofobjects, using OSD #3.

When the number of placement groups increases, the new placementgroups will be assigned OSDs. The result of the CRUSH function willalso change and some objects from the former placement groups will becopied over to the new Placement Groups and removed from the old ones.

Placement Groups Tradeoffs

Data durability and even distribution among all OSDs call for moreplacement groups but their number should be reduced to the minimum tosave CPU and memory.

Data durability

After an OSD fails, the risk of data loss increases until the data itcontained is fully recovered. Let’s imagine a scenario that causespermanent data loss in a single placement group:

  • The OSD fails and all copies of the object it contains are lost.For all objects within the placement group the number of replicasuddenly drops from three to two.

  • Ceph starts recovery for this placement group by choosing a new OSDto re-create the third copy of all objects.

  • Another OSD, within the same placement group, fails before the newOSD is fully populated with the third copy. Some objects will thenonly have one surviving copies.

  • Ceph picks yet another OSD and keeps copying objects to restore thedesired number of copies.

  • A third OSD, within the same placement group, fails before recoveryis complete. If this OSD contained the only remaining copy of anobject, it is permanently lost.

In a cluster containing 10 OSDs with 512 placement groups in a threereplica pool, CRUSH will give each placement groups three OSDs. In theend, each OSDs will end up hosting (512 * 3) / 10 = ~150 PlacementGroups. When the first OSD fails, the above scenario will thereforestart recovery for all 150 placement groups at the same time.

The 150 placement groups being recovered are likely to behomogeneously spread over the 9 remaining OSDs. Each remaining OSD istherefore likely to send copies of objects to all others and alsoreceive some new objects to be stored because they became part of anew placement group.

The amount of time it takes for this recovery to complete entirelydepends on the architecture of the Ceph cluster. Let say each OSD ishosted by a 1TB SSD on a single machine and all of them are connectedto a 10Gb/s switch and the recovery for a single OSD completes withinM minutes. If there are two OSDs per machine using spinners with noSSD journal and a 1Gb/s switch, it will at least be an order ofmagnitude slower.

In a cluster of this size, the number of placement groups has almostno influence on data durability. It could be 128 or 8192 and therecovery would not be slower or faster.

However, growing the same Ceph cluster to 20 OSDs instead of 10 OSDsis likely to speed up recovery and therefore improve data durabilitysignificantly. Each OSD now participates in only ~75 placement groupsinstead of ~150 when there were only 10 OSDs and it will still requireall 19 remaining OSDs to perform the same amount of object copies inorder to recover. But where 10 OSDs had to copy approximately 100GBeach, they now have to copy 50GB each instead. If the network was thebottleneck, recovery will happen twice as fast. In other words,recovery goes faster when the number of OSDs increases.

If this cluster grows to 40 OSDs, each of them will only host ~35placement groups. If an OSD dies, recovery will keep going fasterunless it is blocked by another bottleneck. However, if this clustergrows to 200 OSDs, each of them will only host ~7 placement groups. Ifan OSD dies, recovery will happen between at most of ~21 (7 * 3) OSDsin these placement groups: recovery will take longer than when therewere 40 OSDs, meaning the number of placement groups should beincreased.

No matter how short the recovery time is, there is a chance for asecond OSD to fail while it is in progress. In the 10 OSDs clusterdescribed above, if any of them fail, then ~17 placement groups(i.e. ~150 / 9 placement groups being recovered) will only have onesurviving copy. And if any of the 8 remaining OSD fail, the lastobjects of two placement groups are likely to be lost (i.e. ~17 / 8placement groups with only one remaining copy being recovered).

When the size of the cluster grows to 20 OSDs, the number of PlacementGroups damaged by the loss of three OSDs drops. The second OSD lostwill degrade ~4 (i.e. ~75 / 19 placement groups being recovered)instead of ~17 and the third OSD lost will only lose data if it is oneof the four OSDs containing the surviving copy. In other words, if theprobability of losing one OSD is 0.0001% during the recovery timeframe, it goes from 17 10 0.0001% in the cluster with 10 OSDs to 4 20 0.0001% in the cluster with 20 OSDs.

In a nutshell, more OSDs mean faster recovery and a lower risk ofcascading failures leading to the permanent loss of a PlacementGroup. Having 512 or 4096 Placement Groups is roughly equivalent in acluster with less than 50 OSDs as far as data durability is concerned.

Note: It may take a long time for a new OSD added to the cluster to bepopulated with placement groups that were assigned to it. Howeverthere is no degradation of any object and it has no impact on thedurability of the data contained in the Cluster.

Object distribution within a pool

Ideally objects are evenly distributed in each placement group. SinceCRUSH computes the placement group for each object, but does notactually know how much data is stored in each OSD within thisplacement group, the ratio between the number of placement groups andthe number of OSDs may influence the distribution of the datasignificantly.

For instance, if there was a single placement group for ten OSDs in athree replica pool, only three OSD would be used because CRUSH wouldhave no other choice. When more placement groups are available,objects are more likely to be evenly spread among them. CRUSH alsomakes every effort to evenly spread OSDs among all existing PlacementGroups.

As long as there are one or two orders of magnitude more PlacementGroups than OSDs, the distribution should be even. For instance, 256placement groups for 3 OSDs, 512 or 1024 placement groups for 10 OSDsetc.

Uneven data distribution can be caused by factors other than the ratiobetween OSDs and placement groups. Since CRUSH does not take intoaccount the size of the objects, a few very large objects may createan imbalance. Let say one million 4K objects totaling 4GB are evenlyspread among 1024 placement groups on 10 OSDs. They will use 4GB / 10= 400MB on each OSD. If one 400MB object is added to the pool, thethree OSDs supporting the placement group in which the object has beenplaced will be filled with 400MB + 400MB = 800MB while the sevenothers will remain occupied with only 400MB.

Memory, CPU and network usage

For each placement group, OSDs and MONs need memory, network and CPUat all times and even more during recovery. Sharing this overhead byclustering objects within a placement group is one of the main reasonsthey exist.

Minimizing the number of placement groups saves significant amounts ofresources.

Choosing the number of Placement Groups

If you have more than 50 OSDs, we recommend approximately 50-100placement groups per OSD to balance out resource usage, datadurability and distribution. If you have less than 50 OSDs, choosingamong the preselection above is best. For a single pool of objects,you can use the following formula to get a baseline:

  1. (OSDs * 100)
  2. Total PGs = ------------
  3. pool size

Where pool size is either the number of replicas for replicatedpools or the K+M sum for erasure coded pools (as returned by cephosd erasure-code-profile get).

You should then check if the result makes sense with the way youdesigned your Ceph cluster to maximize data durability,object distribution and minimize resource usage.

The result should always be rounded up to the nearest power of two.

Only a power of two will evenly balance the number of objects amongplacement groups. Other values will result in an uneven distribution ofdata across your OSDs. Their use should be limited to incrementallystepping from one power of two to another.

As an example, for a cluster with 200 OSDs and a pool size of 3replicas, you would estimate your number of PGs as follows:

  1. (200 * 100)
  2. ----------- = 6667. Nearest power of 2: 8192
  3. 3

When using multiple data pools for storing objects, you need to ensurethat you balance the number of placement groups per pool with thenumber of placement groups per OSD so that you arrive at a reasonabletotal number of placement groups that provides reasonably low varianceper OSD without taxing system resources or making the peering processtoo slow.

For instance a cluster of 10 pools each with 512 placement groups onten OSDs is a total of 5,120 placement groups spread over ten OSDs,that is 512 placement groups per OSD. That does not use too manyresources. However, if 1,000 pools were created with 512 placementgroups each, the OSDs will handle ~50,000 placement groups each and itwould require significantly more resources and time for peering.

You may find the PGCalc tool helpful.

Set the Number of Placement Groups

To set the number of placement groups in a pool, you must specify thenumber of placement groups at the time you create the pool.See Create a Pool for details. Even after a pool is created you can also change the number of placement groups with:

  1. ceph osd pool set {pool-name} pg_num {pg_num}

After you increase the number of placement groups, you must alsoincrease the number of placement groups for placement (pgp_num)before your cluster will rebalance. The pgp_num will be the number ofplacement groups that will be considered for placement by the CRUSHalgorithm. Increasing pg_num splits the placement groups but datawill not be migrated to the newer placement groups until placementgroups for placement, ie. pgp_num is increased. The pgp_numshould be equal to the pg_num. To increase the number ofplacement groups for placement, execute the following:

  1. ceph osd pool set {pool-name} pgp_num {pgp_num}

When decreasing the number of PGs, pgp_num is adjustedautomatically for you.

Get the Number of Placement Groups

To get the number of placement groups in a pool, execute the following:

  1. ceph osd pool get {pool-name} pg_num

Get a Cluster’s PG Statistics

To get the statistics for the placement groups in your cluster, execute the following:

  1. ceph pg dump [--format {format}]

Valid formats are plain (default) and json.

Get Statistics for Stuck PGs

To get the statistics for all placement groups stuck in a specified state,execute the following:

  1. ceph pg dump_stuck inactive|unclean|stale|undersized|degraded [--format <format>] [-t|--threshold <seconds>]

Inactive Placement groups cannot process reads or writes because they are waiting for an OSDwith the most up-to-date data to come up and in.

Unclean Placement groups contain objects that are not replicated the desired numberof times. They should be recovering.

Stale Placement groups are in an unknown state - the OSDs that host them have notreported to the monitor cluster in a while (configured by mon_osd_report_timeout).

Valid formats are plain (default) and json. The threshold defines the minimum numberof seconds the placement group is stuck before including it in the returned statistics(default 300 seconds).

Get a PG Map

To get the placement group map for a particular placement group, execute the following:

  1. ceph pg map {pg-id}

For example:

  1. ceph pg map 1.6c

Ceph will return the placement group map, the placement group, and the OSD status:

  1. osdmap e13 pg 1.6c (1.6c) -> up [1,0] acting [1,0]

Get a PGs Statistics

To retrieve statistics for a particular placement group, execute the following:

  1. ceph pg {pg-id} query

Scrub a Placement Group

To scrub a placement group, execute the following:

  1. ceph pg scrub {pg-id}

Ceph checks the primary and any replica nodes, generates a catalog of all objectsin the placement group and compares them to ensure that no objects are missingor mismatched, and their contents are consistent. Assuming the replicas allmatch, a final semantic sweep ensures that all of the snapshot-related objectmetadata is consistent. Errors are reported via logs.

To scrub all placement groups from a specific pool, execute the following:

  1. ceph osd pool scrub {pool-name}

Prioritize backfill/recovery of a Placement Group(s)

You may run into a situation where a bunch of placement groups will requirerecovery and/or backfill, and some particular groups hold data more importantthan others (for example, those PGs may hold data for images used by runningmachines and other PGs may be used by inactive machines/less relevant data).In that case, you may want to prioritize recovery of those groups soperformance and/or availability of data stored on those groups is restoredearlier. To do this (mark particular placement group(s) as prioritized duringbackfill or recovery), execute the following:

  1. ceph pg force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
  2. ceph pg force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

This will cause Ceph to perform recovery or backfill on specified placementgroups first, before other placement groups. This does not interrupt currentlyongoing backfills or recovery, but causes specified PGs to be processedas soon as possible. If you change your mind or prioritize wrong groups,use:

  1. ceph pg cancel-force-recovery {pg-id} [{pg-id #2}] [{pg-id #3} ...]
  2. ceph pg cancel-force-backfill {pg-id} [{pg-id #2}] [{pg-id #3} ...]

This will remove “force” flag from those PGs and they will be processedin default order. Again, this doesn’t affect currently processed placementgroup, only those that are still queued.

The “force” flag is cleared automatically after recovery or backfill of groupis done.

Similarly, you may use the following commands to force Ceph to perform recoveryor backfill on all placement groups from a specified pool first:

  1. ceph osd pool force-recovery {pool-name}
  2. ceph osd pool force-backfill {pool-name}

or:

  1. ceph osd pool cancel-force-recovery {pool-name}
  2. ceph osd pool cancel-force-backfill {pool-name}

to restore to the default recovery or backfill priority if you change your mind.

Note that these commands could possibly break the ordering of Ceph’s internalpriority computations, so use them with caution!Especially, if you have multiple pools that are currently sharing the sameunderlying OSDs, and some particular pools hold data more important than others,we recommend you use the following command to re-arrange all pools’srecovery/backfill priority in a better order:

  1. ceph osd pool set {pool-name} recovery_priority {value}

For example, if you have 10 pools you could make the most important one priority 10,next 9, etc. Or you could leave most pools alone and have say 3 important poolsall priority 1 or priorities 3, 2, 1 respectively.

Revert Lost

If the cluster has lost one or more objects, and you have decided toabandon the search for the lost data, you must mark the unfound objectsas lost.

If all possible locations have been queried and objects are stilllost, you may have to give up on the lost objects. This ispossible given unusual combinations of failures that allow the clusterto learn about writes that were performed before the writes themselvesare recovered.

Currently the only supported option is “revert”, which will either roll back toa previous version of the object or (if it was a new object) forget about itentirely. To mark the “unfound” objects as “lost”, execute the following:

  1. ceph pg {pg-id} mark_unfound_lost revert|delete

Important

Use this feature with caution, because it may confuseapplications that expect the object(s) to exist.