RGW Dynamic Bucket Index Resharding

New in version Luminous.

A large bucket index can lead to performance problems. In orderto address this problem we introduced bucket index sharding.Until Luminous, changing the number of bucket shards (resharding)needed to be done offline. Starting with Luminous we supportonline bucket resharding.

Each bucket index shard can handle its entries efficiently up untilreaching a certain threshold number of entries. If this threshold isexceeded the system can encounter performance issues. The dynamicresharding feature detects this situation and automatically increasesthe number of shards used by the bucket index, resulting in thereduction of the number of entries in each bucket index shard. Thisprocess is transparent to the user.

By default dynamic bucket index resharding can only increase thenumber of bucket index sharts to 1999, although the upper-bound is aconfiguration parameter (see Configuration below). Furthermore, whenpossible, the process chooses a prime number of bucket index shards tohelp spread the number of bucket index entries across the bucket indexshards more evenly.

The detection process runs in a background process that periodicallyscans all the buckets. A bucket that requires resharding is added tothe resharding queue and will be scheduled to be resharded later. Thereshard thread runs in the background and execute the scheduledresharding tasks, one at a time.

Multisite

Dynamic resharding is not supported in a multisite environment.

Configuration

Enable/Disable dynamic bucket index resharding:

  • rgw_dynamic_resharding: true/false, default: true

Configuration options that control the resharding process:

  • rgw_max_objs_per_shard: maximum number of objects per bucket index shard before resharding is triggered, default: 100000 objects

  • rgw_max_dynamic_shards: maximum number of shards that dynamic bucket index resharding can increase to, default: 1999

  • rgw_reshard_bucket_lock_duration: duration, in seconds, of lock on bucket obj during resharding, default: 360 seconds (i.e., 6 minutes)

  • rgw_reshard_thread_interval: maximum time, in seconds, between rounds of resharding queue processing, default: 600 seconds (i.e., 10 minutes)

  • rgw_reshard_num_logs: number of shards for the resharding queue, default: 16

Admin commands

Add a bucket to the resharding queue

  1. # radosgw-admin reshard add --bucket <bucket_name> --num-shards <new number of shards>

List resharding queue

  1. # radosgw-admin reshard list

Process tasks on the resharding queue

  1. # radosgw-admin reshard process

Bucket resharding status

  1. # radosgw-admin reshard status --bucket <bucket_name>

The output is a json array of 3 objects (reshard_status, new_bucket_instance_id, num_shards) per shard.

For example, the output at different Dynamic Resharding stages is shown below:

1. Before resharding occurred:

  1. [
  2. {
  3. "reshard_status": "not-resharding",
  4. "new_bucket_instance_id": "",
  5. "num_shards": -1
  6. }
  7. ]

2. During resharding:

  1. [
  2. {
  3. "reshard_status": "in-progress",
  4. "new_bucket_instance_id": "1179f470-2ebf-4630-8ec3-c9922da887fd.8652.1",
  5. "num_shards": 2
  6. },
  7. {
  8. "reshard_status": "in-progress",
  9. "new_bucket_instance_id": "1179f470-2ebf-4630-8ec3-c9922da887fd.8652.1",
  10. "num_shards": 2
  11. }
  12. ]

3, After resharding completed:

  1. [
  2. {
  3. "reshard_status": "not-resharding",
  4. "new_bucket_instance_id": "",
  5. "num_shards": -1
  6. },
  7. {
  8. "reshard_status": "not-resharding",
  9. "new_bucket_instance_id": "",
  10. "num_shards": -1
  11. }
  12. ]

Cancel pending bucket resharding

Note: Ongoing bucket resharding operations cannot be cancelled.

  1. # radosgw-admin reshard cancel --bucket <bucket_name>

Manual immediate bucket resharding

  1. # radosgw-admin bucket reshard --bucket <bucket_name> --num-shards <new number of shards>

When choosing a number of shards, the administrator should keep anumber of items in mind. Ideally the administrator is aiming for nomore than 100000 entries per shard, now and through some future pointin time.

Additionally, bucket index shards that are prime numbers tend to workbetter in evenly distributing bucket index entries across theshards. For example, 7001 bucket index shards is better than 7000since the former is prime. A variety of web sites have lists of primenumbers; search for “list of prime numbers” withy your favorite websearch engine to locate some web sites.

Troubleshooting

Clusters prior to Luminous 12.2.11 and Mimic 13.2.5 left behind stale bucketinstance entries, which were not automatically cleaned up. The issue also affectedLifeCycle policies, which were not applied to resharded buckets anymore. Both ofthese issues can be worked around using a couple of radosgw-admin commands.

Stale instance management

List the stale instances in a cluster that are ready to be cleaned up.

  1. # radosgw-admin reshard stale-instances list

Clean up the stale instances in a cluster. Note: cleanup of theseinstances should only be done on a single site cluster.

  1. # radosgw-admin reshard stale-instances rm

Lifecycle fixes

For clusters that had resharded instances, it is highly likely that the oldlifecycle processes would have flagged and deleted lifecycle processing as thebucket instance changed during a reshard. While this is fixed for newer clusters(from Mimic 13.2.6 and Luminous 12.2.12), older buckets that had lifecycle policies andthat have undergone resharding will have to be manually fixed.

The command to do so is:

  1. # radosgw-admin lc reshard fix --bucket {bucketname}

As a convenience wrapper, if the —bucket argument is dropped then thiscommand will try and fix lifecycle policies for all the buckets in the cluster.

Object Expirer fixes

Objects subject to Swift object expiration on older clusters may havebeen dropped from the log pool and never deleted after the bucket wasresharded. This would happen if their expiration time was before thecluster was upgraded, but if their expiration was after the upgradethe objects would be correctly handled. To manage these expire-staleobjects, radosgw-admin provides two subcommands.

Listing:

  1. # radosgw-admin objects expire-stale list --bucket {bucketname}

Displays a list of object names and expiration times in JSON format.

Deleting:

  1. # radosgw-admin objects expire-stale rm --bucket {bucketname}

Initiates deletion of such objects, displaying a list of object names, expiration times, and deletion status in JSON format.