Device Management
Ceph tracks which hardware storage devices (e.g., HDDs, SSDs) are consumed bywhich daemons, and collects health metrics about those devices in order toprovide tools to predict and/or automatically respond to hardware failure.
Device tracking
You can query which storage devices are in use with:
- ceph device ls
You can also list devices by daemon or by host:
- ceph device ls-by-daemon <daemon>
- ceph device ls-by-host <host>
For any individual device, you can query information about itslocation and how it is being consumed with:
- ceph device info <devid>
Enabling monitoring
Ceph can also monitor health metrics associated with your device. Forexample, SATA hard disks implement a standard called SMART thatprovides a wide range of internal metrics about the device’s usage andhealth, like the number of hours powered on, number of power cycles,or unrecoverable read errors. Other device types like SAS and NVMeimplement a similar set of metrics (via slightly different standards).All of these can be collected by Ceph via the smartctl
tool.
You can enable or disable health monitoring with:
- ceph device monitoring on
or:
- ceph device monitoring off
Scraping
If monitoring is enabled, metrics will automatically be scraped at regular intervals. That interval can be configured with:
- ceph config set mgr mgr/devicehealth/scrape_frequency <seconds>
The default is to scrape once every 24 hours.
You can manually trigger a scrape of all devices with:
- ceph device scrape-health-metrics
A single device can be scraped with:
- ceph device scrape-health-metrics <device-id>
Or a single daemon’s devices can be scraped with:
- ceph device scrape-daemon-health-metrics <who>
The stored health metrics for a device can be retrieved (optionallyfor a specific timestamp) with:
- ceph device get-health-metrics <devid> [sample-timestamp]
Failure prediction
Ceph can predict life expectancy and device failures based on thehealth metrics it collects. There are three modes:
none: disable device failure prediction.
local: use a pre-trained prediction model from the ceph-mgr daemon
cloud: share device health and performance metrics an externalcloud service run by ProphetStor, using either their free service ora paid service with more accurate predictions
The prediction mode can be configured with:
- ceph config set global device_failure_prediction_mode <mode>
Prediction normally runs in the background on a periodic basis, so itmay take some time before life expectancy values are populated. Youcan see the life expectancy of all devices in output from:
- ceph device ls
You can also query the metadata for a specific device with:
- ceph device info <devid>
You can explicitly force prediction of a device’s life expectancy with:
- ceph device predict-life-expectancy <devid>
If you are not using Ceph’s internal device failure prediction buthave some external source of information about device failures, youcan inform Ceph of a device’s life expectancy with:
- ceph device set-life-expectancy <devid> <from> [<to>]
Life expectancies are expressed as a time interval so thatuncertainty can be expressed in the form of a wide interval. Theinterval end can also be left unspecified.
Health alerts
The mgr/devicehealth/warn_threshold
controls how soon an expecteddevice failure must be before we generate a health warning.
The stored life expectancy of all devices can be checked, and anyappropriate health alerts generated, with:
- ceph device check-health
Automatic Mitigation
If the mgr/devicehealth/self_heal
option is enabled (it is bydefault), then for devices that are expected to fail soon the modulewill automatically migrate data away from them by marking the devices“out”.
The mgr/devicehealth/mark_out_threshold
controls how soon anexpected device failure must be before we automatically mark an osd“out”.