Advanced Cluster Configuration

These examples show how to perform advanced configuration tasks on your Rook storage cluster.

Prerequisites

Most of the examples make use of the ceph client command. A quick way to use the Ceph client suite is from a Rook Toolbox container.

The Kubernetes based examples assume Rook OSD pods are in the rook-ceph namespace. If you run them in a different namespace, modify kubectl -n rook-ceph [...] to fit your situation.

Log Collection

All Rook logs can be collected in a Kubernetes environment with the following command:

  1. (for p in $(kubectl -n rook-ceph get pods -o jsonpath='{.items[*].metadata.name}')
  2. do
  3. for c in $(kubectl -n rook-ceph get pod ${p} -o jsonpath='{.spec.containers[*].name}')
  4. do
  5. echo "BEGIN logs from pod: ${p} ${c}"
  6. kubectl -n rook-ceph logs -c ${c} ${p}
  7. echo "END logs from pod: ${p} ${c}"
  8. done
  9. done
  10. for i in $(kubectl -n rook-ceph-system get pods -o jsonpath='{.items[*].metadata.name}')
  11. do
  12. echo "BEGIN logs from pod: ${i}"
  13. kubectl -n rook-ceph-system logs ${i}
  14. echo "END logs from pod: ${i}"
  15. done) | gzip > /tmp/rook-logs.gz

This gets the logs for every container in every Rook pod and then compresses them into a .gz archive for easy sharing. Note that instead of gzip, you could instead pipe to less or to a single text file.

OSD Information

Keeping track of OSDs and their underlying storage devices/directories can be difficult. The following scripts will clear things up quickly.

Kubernetes

  1. # Get OSD Pods
  2. # This uses the example/default cluster name "rook"
  3. OSD_PODS=$(kubectl get pods --all-namespaces -l \
  4. app=rook-ceph-osd,rook_cluster=rook-ceph -o jsonpath='{.items[*].metadata.name}')
  5. # Find node and drive associations from OSD pods
  6. for pod in $(echo ${OSD_PODS})
  7. do
  8. echo "Pod: ${pod}"
  9. echo "Node: $(kubectl -n rook-ceph get pod ${pod} -o jsonpath='{.spec.nodeName}')"
  10. kubectl -n rook-ceph exec ${pod} -- sh -c '\
  11. for i in /var/lib/rook/osd*; do
  12. [ -f ${i}/ready ] || continue
  13. echo -ne "-$(basename ${i}) "
  14. echo $(lsblk -n -o NAME,SIZE ${i}/block 2> /dev/null || \
  15. findmnt -n -v -o SOURCE,SIZE -T ${i}) $(cat ${i}/type)
  16. done|sort -V
  17. echo'
  18. done

The output should look something like this. Note that OSDs on the same node will show duplicate information.

  1. Pod: osd-m2fz2
  2. Node: node1.zbrbdl
  3. -osd0 sda3 557.3G bluestore
  4. -osd1 sdf3 110.2G bluestore
  5. -osd2 sdd3 277.8G bluestore
  6. -osd3 sdb3 557.3G bluestore
  7. -osd4 sde3 464.2G bluestore
  8. -osd5 sdc3 557.3G bluestore
  9. Pod: osd-nxxnq
  10. Node: node3.zbrbdl
  11. -osd6 sda3 110.7G bluestore
  12. -osd17 sdd3 1.8T bluestore
  13. -osd18 sdb3 231.8G bluestore
  14. -osd19 sdc3 231.8G bluestore
  15. Pod: osd-tww1h
  16. Node: node2.zbrbdl
  17. -osd7 sdc3 464.2G bluestore
  18. -osd8 sdj3 557.3G bluestore
  19. -osd9 sdf3 66.7G bluestore
  20. -osd10 sdd3 464.2G bluestore
  21. -osd11 sdb3 147.4G bluestore
  22. -osd12 sdi3 557.3G bluestore
  23. -osd13 sdk3 557.3G bluestore
  24. -osd14 sde3 66.7G bluestore
  25. -osd15 sda3 110.2G bluestore
  26. -osd16 sdh3 135.1G bluestore

Separate Storage Groups

By default Rook/Ceph puts all storage under one replication rule in the CRUSH Map which provides the maximum amount of storage capacity for a cluster. If you would like to use different storage endpoints for different purposes, you’ll have to create separate storage groups.

In the following example we will separate SSD drives from spindle-based drives, a common practice for those looking to target certain workloads onto faster (database) or slower (file archive) storage.

CRUSH Heirarchy

To see the CRUSH hierarchy of all your hosts and OSDs run:

  1. ceph osd tree

Before we separate our disks into groups, our example cluster looks like this:

  1. ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
  2. -1 7.21828 root default
  3. -2 0.94529 host node1
  4. 0 0.55730 osd.0 up 1.00000 1.00000
  5. 1 0.11020 osd.1 up 1.00000 1.00000
  6. 2 0.27779 osd.2 up 1.00000 1.00000
  7. -3 1.22480 host node2
  8. 3 0.55730 osd.3 up 1.00000 1.00000
  9. 4 0.11020 osd.4 up 1.00000 1.00000
  10. 5 0.55730 osd.5 up 1.00000 1.00000
  11. -4 1.22480 host node3
  12. 6 0.55730 osd.6 up 1.00000 1.00000
  13. 7 0.11020 osd.7 up 1.00000 1.00000
  14. 8 0.06670 osd.8 up 1.00000 1.00000

We have one root bucket default that every host and OSD is under, so all of these storage locations get pooled together for reads/writes/replication.

Let’s say that osd.1, osd.3, and osd.7 are our small SSD drives that we want to use separately.

First we will create a new root bucket called ssd in our CRUSH map. Under this new bucket we will add new host buckets for each node that contains an SSD drive so data can be replicated and used separately from the default HDD group.

  1. # Create a new tree in the CRUSH Map for SSD hosts and OSDs
  2. ceph osd crush add-bucket ssd root
  3. ceph osd crush add-bucket node1-ssd host
  4. ceph osd crush add-bucket node2-ssd host
  5. ceph osd crush add-bucket node3-ssd host
  6. ceph osd crush move node1-ssd root=ssd
  7. ceph osd crush move node2-ssd root=ssd
  8. ceph osd crush move node3-ssd root=ssd
  9. # Create a new rule for replication using the new tree
  10. ceph osd crush rule create-simple ssd ssd host firstn

Secondly we will move the SSD OSDs into the new ssd tree, under their respective host buckets:

  1. ceph osd crush set osd.1 .1102 root=ssd host=node1-ssd
  2. ceph osd crush set osd.3 .1102 root=ssd host=node2-ssd
  3. ceph osd crush set osd.7 .1102 root=ssd host=node3-ssd

It’s important to note that the ceph osd crush set command requires a weight to be specified (our example uses .1102). If you’d like to change their weight you can do that here, otherwise be sure to specify their original weight seen in the ceph osd tree output.

So let’s look at our CRUSH tree again with these changes:

  1. ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
  2. -8 0.22040 root ssd
  3. -5 0.11020 host node1-ssd
  4. 1 0.11020 osd.1 up 1.00000 1.00000
  5. -6 0.11020 host node2-ssd
  6. 4 0.11020 osd.4 up 1.00000 1.00000
  7. -7 0.11020 host node3-ssd
  8. 7 0.11020 osd.7 up 1.00000 1.00000
  9. -1 7.21828 root default
  10. -2 0.83509 host node1
  11. 0 0.55730 osd.0 up 1.00000 1.00000
  12. 2 0.27779 osd.2 up 1.00000 1.00000
  13. -3 1.11460 host node2
  14. 3 0.55730 osd.3 up 1.00000 1.00000
  15. 5 0.55730 osd.5 up 1.00000 1.00000
  16. -4 1.11460 host node3
  17. 6 0.55730 osd.6 up 1.00000 1.00000
  18. 8 0.55730 osd.8 up 1.00000 1.00000

Using Disk Groups With Pools

Now we have a separate storage group for our SSDs, but we can’t use that storage until we associate a pool with it. The default group already has a pool called rbd in many cases. If you created a pool via CustomResourceDefinition, it will use the default storage group as well.

Here’s how to create new pools:

  1. # SSD backed pool with 128 (total) PGs
  2. ceph osd pool create ssd 128 128 replicated ssd

Now all you need to do is create RBD images or Kubernetes StorageClasses that specify the ssd pool to put it to use.

Configuring Pools

Placement Group Sizing

The general rules for deciding how many PGs your pool(s) should contain is:

  • Less than 5 OSDs set pg_num to 128
  • Between 5 and 10 OSDs set pg_num to 512
  • Between 10 and 50 OSDs set pg_num to 1024

If you have more than 50 OSDs, you need to understand the tradeoffs and how to calculate the pg_num value by yourself. For calculating pg_num yourself please make use of the pgcalc tool

If you’re already using a pool it is generally safe to increase its PG count on-the-fly. Decreasing the PG count is not recommended on a pool that is in use. The safest way to decrease the PG count is to back-up the data, delete the pool, and recreate it. With backups you can try a few potentially unsafe tricks for live pools, documented here.

Deleting A Pool

Be warned that this deletes all data from the pool, so Ceph by default makes it somewhat difficult to do.

First you must inject arguments to the Mon daemons to tell them to allow the deletion of pools. In Rook Tools you can do this:

  1. ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'

Then to delete a pool, rbd in this example, run:

  1. ceph osd pool rm rbd rbd --yes-i-really-really-mean-it

Creating A Pool

  1. # Create a pool called rbd with 1024 total PGs, using the default
  2. # replication ruleset
  3. ceph osd pool create rbd 1024 1024 replicated replicated_ruleset

replicated_ruleset is the default CRUSH rule that replicates between the hosts and OSDs in the default root hierarchy.

Setting The Number Of Replicas

The size setting of a pool tells the cluster how many copies of the data should be kept for redundancy. By default the cluster will distribute these copies between host buckets in the CRUSH Map This can be set when creating a pool via CustomResourceDefinition or after creation with ceph.

So for example let’s change the size of the rbd pool to three:

  1. ceph osd pool set rbd size 3

Now if you run ceph -s you may see “recovery” operations and PGs in “undersized” and other “unclean” states. The cluster is essentially fixing itself since the number of replicas has been increased, and should go back to “active/clean” state shortly, after data has been replicated between hosts. When that’s done you will be able to lose two of your storage nodes and still have access to all your data in that pool, since the CRUSH algorithm will guarantee that at least one replica will still be available on another storage node. Of course you will only have 1/3 the capacity as a tradeoff.

Setting PG Count

Be sure to read the placement group sizing section before changing the number of PGs.

  1. # Set the number of PGs in the rbd pool to 512
  2. ceph osd pool set rbd pg_num 512

Custom ceph.conf Settings

With Rook the full swath of Ceph settings are available to use on your storage cluster. When we supply Rook with a ceph.conf file those settings will be propagated to all Mon, OSD, MDS, and RGW daemons to use.

In this example we will set the default pool size to two, and tell OSD daemons not to change the weight of OSDs on startup.

WARNING: Modify Ceph settings carefully. You are leaving the sandbox tested by Rook. Changing the settings could result in unhealthy daemons or even data loss if used incorrectly.

Kubernetes

When the Rook Operator creates a cluster, a placeholder ConfigMap is created that will allow you to override Ceph configuration settings. When the daemon pods are started, the settings specified in this ConfigMap will be merged with the default settings generated by Rook.

The default override settings are blank. Cutting out the extraneous properties, we would see the following defaults after creating a cluster:

  1. $ kubectl -n rook-ceph get ConfigMap rook-config-override -o yaml
  2. kind: ConfigMap
  3. apiVersion: v1
  4. metadata:
  5. name: rook-config-override
  6. namespace: rook-ceph
  7. data:
  8. config: ""

To apply your desired configuration, you will need to update this ConfigMap. The next time the daemon pod(s) start, the settings will be merged with the default settings created by Rook.

  1. kubectl -n rook-ceph edit configmap rook-config-override

Modify the settings and save. Each line you add should be indented from the config property as such:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: rook-config-override
  5. namespace: rook-ceph
  6. data:
  7. config: |
  8. [global]
  9. osd crush update on start = false
  10. osd pool default size = 2

Each daemon will need to be restarted where you want the settings applied:

  • Mons: ensure all three mons are online and healthy before restarting each mon pod, one at a time
  • OSDs: restart your the pods by deleting them, one at a time, and running ceph -s between each restart to ensure the cluster goes back to “active/clean” state.
  • RGW: the pods are stateless and can be restarted as needed
  • MDS: the pods are stateless and can be restarted as needed

After the pod restart, your new settings should be in effect. Note that if you create the ConfigMap in the rook namespace before the cluster is even created the daemons will pick up the settings at first launch.

The only validation of the settings done by Rook is whether the settings can be merged using the ini file format with the default settings created by Rook. Beyond that, the validity of the settings is your responsibility.

OSD CRUSH Settings

A useful view of the CRUSH Map is generated with the following command:

  1. ceph osd tree

In this section we will be tweaking some of the values seen in the output.

OSD Weight

The CRUSH weight controls the ratio of data that should be distributed to each OSD. This also means a higher or lower amount of disk I/O operations for an OSD with higher/lower weight, respectively.

By default OSDs get a weight relative to their storage capacity, which maximizes overall cluster capacity by filling all drives at the same rate, even if drive sizes vary. This should work for most use-cases, but the following situations could warrant weight changes:

  • Your cluster has some relatively slow OSDs or nodes. Lowering their weight can reduce the impact of this bottleneck.
  • You’re using bluestore drives provisioned with Rook v0.3.1 or older. In this case you may notice OSD weights did not get set relative to their storage capacity. Changing the weight can fix this and maximize cluster capacity.

This example sets the weight of osd.0 which is 600GiB

  1. ceph osd crush reweight osd.0 .600

OSD Primary Affinity

When pools are set with a size setting greater than one, data is replicated between nodes and OSDs. For every chunk of data a Primary OSD is selected to be used for reading that data to be sent to clients. You can control how likely it is for an OSD to become a Primary using the Primary Affinity setting. This is similar to the OSD weight setting, except it only affects reads on the storage device, not capacity or writes.

In this example we will make sure osd.0 is only selected as Primary if all other OSDs holding replica data are unavailable:

  1. ceph osd primary-affinity osd.0 0

Phantom OSD Removal

If you have OSDs in which are not showing any disks, you can remove those “Phantom OSDs” by following the instructions below. To check for “Phantom OSDs”, you can run:

  1. ceph osd tree

An example output looks like this:

  1. ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
  2. -1 57.38062 root default
  3. -13 7.17258 host node1.example.com
  4. 2 hdd 3.61859 osd.2 up 1.00000 1.00000
  5. -7 0 host node2.example.com down 0 1.00000

The host node2.example.com in the output has no disks, so it is most likely a “Phantom OSD”.

Now to remove it, use the ID in the first column of the output and replace <ID> with it. In the example output above the ID would be -7. The commands are:

  1. ceph osd out <ID>
  2. ceph osd crush remove osd.<ID>
  3. ceph auth del osd.<ID>
  4. ceph osd rm <ID>

To recheck that the Phantom OSD got removed, re-run the following command and check if the OSD with the ID doesn’t show up anymore:

  1. ceph osd tree