Overview of Fault Detection and Recovery

Overview of Fault Detection and Recovery

The Greenplum Database server (postgres) subprocess named ftsprobe handles fault detection. ftsprobe monitors the Greenplum Database array; it connects to and scans all segments and database processes at intervals that you can configure.

If ftsprobe cannot connect to a segment, it marks the segment as “down” in the Greenplum Database system catalog. The segment remains nonoperational until an administrator initiates the recovery process.

With mirroring enabled, Greenplum Database automatically fails over to a mirror copy if a primary copy becomes unavailable. The system is operational if a segment instance or host fails provided all data is available on the remaining active segments.

To recover failed segments, an administrator runs the gprecoverseg recovery utility. This utility locates the failed segments, verifies they are valid, and compares the transactional state with the currently active segment to determine changes made while the segment was offline. gprecoverseg synchronizes the changed database files with the active segment and brings the segment back online. Administrators perform the recovery while Greenplum Database is up and running.

With mirroring deactivated, the system automatically shuts down if a segment instance fails. Administrators manually recover all failed segments before operations resume.

See Detecting a Failed Segment for a more detailed description of the fault detection and recovery process and configuration options.

Parent topic: Overview of Greenplum Database High Availability