Best Practices for Operating HAWQ

Best Practices for Operating HAWQ

This topic provides best practices for operating HAWQ, including recommendations for stopping, starting and monitoring HAWQ.

Best Practices Using the Command Line to Start/Stop HAWQ Cluster Members

The following best practices are recommended when using hawq start and hawq stop to manage your HAWQ cluster.

Always use HAWQ management commands to start and stop HAWQ, instead of Postgres equivalents.
Use hawq start cluster or hawq restart cluster to start the entire cluster, rather than starting the master and individual segments.
If you do decide to use hawq start standby|master|segment to start nodes individually, always start the standby before the active master. Otherwise, the standby can become unsynchronized with the active master.
When stopping a cluster, issue the CHECKPOINT command to update and flush all data files to disk and update the log file before stopping the cluster. A checkpoint ensures that, in the event of a crash, files can be restored from the checkpoint snapshot.
Stop the entire HAWQ system by stopping the cluster on the master host: shell $ hawq stop cluster
To stop segments and kill any running queries without causing data loss or inconsistency issues, use fast or immediate mode on the cluster:
```
$ hawq stop cluster -M fast
```
```
$ hawq stop cluster -M immediate
```
Use hawq stop master to stop the master only. If you cannot stop the master due to running transactions, try using fast shutdown. If fast shutdown does not work, use immediate shutdown. Use immediate shutdown with caution, as it will result in a crash-recovery run when the system is restarted.
```
$ hawq stop master -M fast
```
```
$ hawq stop master -M immediate
```
When stopping a segment or all segments, you can use the default mode of smart mode. Using fast or immediate mode on segments will have no effect since segments are stateless.
```
$ hawq stop segment
```
```
$ hawq stop allsegments
```

Guidelines for Cluster Expansion

This topic provides some guidelines around expanding your HAWQ cluster.

There are several recommendations to keep in mind when modifying the size of your running HAWQ cluster:

When you add a new node, install both a DataNode and a physical segment on the new node.
After adding a new node, you should always rebalance HDFS data to maintain cluster performance.
Adding or removing a node also necessitates an update to the HDFS metadata cache. This update will happen eventually, but can take some time. To speed the update of the metadata cache, execute select gp_metadata_cache_clear();.
Note that for hash distributed tables, expanding the cluster will not immediately improve performance since hash distributed tables use a fixed number of virtual segments. In order to obtain better performance with hash distributed tables, you must redistribute the table to the updated cluster by either the ALTER TABLE or CREATE TABLE AS command.
If you are using hash tables, consider updating the default_hash_table_bucket_number server configuration parameter to a larger value after expanding the cluster but before redistributing the hash tables.

Best Practices for Monitoring a HAWQ System

The sections below provide recommendations for common monitoring tasks.

Note: If your HAWQ cluster is managed through the Ambari interface, use Ambari to perform the suggested monitoring tasks.

Database State Monitoring Activities

Table 1. Database State Monitoring Activities
Activity	Procedure	Corrective Actions
List segments that are currently down. If any rows are returned, this should generate a warning or alert. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT	Run the following query in the `postgres` database: `SELECT * FROM gp_segment_configuration` `WHERE status <> ‘u’;`	If the query returns any rows, follow these steps to correct the problem: Verify that the hosts with down segments are responsive. If hosts are OK, check the pg_log files for the down segments to discover the root cause of the segments going down.

Hardware and Operating System Monitoring

Table 2. Hardware and Operating System Monitoring Activities
Activity	Procedure	Corrective Actions
Underlying platform check for maintenance required or system down of the hardware. Recommended frequency: real-time, if possible, or every 15 minutes Severity: CRITICAL	Set up system check for hardware and OS errors.	If required, remove a machine from the HAWQ cluster to resolve hardware and OS issues, then add it back to the cluster after the issues are resolved.
Check disk space usage on volumes used for HAWQ data storage and the OS. Recommended frequency: every 5 to 30 minutes Severity: CRITICAL	Set up a disk space check. Set a threshold to raise an alert when a disk reaches a percentage of capacity. The recommended threshold is 75% full. It is not recommended to run the system with capacities approaching 100%.	Free space on the system by removing some data or files.
Check for errors or dropped packets on the network interfaces. Recommended frequency: hourly Severity: IMPORTANT	Set up a network interface checks.	Work with network and OS teams to resolve errors.
Check for RAID errors or degraded RAID performance. Recommended frequency: every 5 minutes Severity: CRITICAL	Set up a RAID check.	Replace failed disks as soon as possible. Work with system administration team to resolve other RAID or controller errors as soon as possible.
Check for adequate I/O bandwidth and I/O skew. Recommended frequency: when create a cluster or when hardware issues are suspected.	Run the HAWQ `hawq checkperf` utility.	The cluster may be under-specified if data transfer rates are not similar to the following: 2GB per second disk read 1 GB per second disk write 10 Gigabit per second network read and write If transfer rates are lower than expected, consult with your data architect regarding performance expectations. If the machines on the cluster display an uneven performance profile, work with the system administration team to fix faulty machines.

Data Maintenance

Table 3. Data Maintenance Activities
Activity	Procedure	Corrective Actions
Check for missing statistics on tables.	Check the `hawq_stats_missing` view in each database: `SELECT * FROM hawq_toolkit.hawq_stats_missing;`	Run `ANALYZE` on tables that are missing statistics.

Database Maintenance

Table 4. Database Maintenance Activities
Activity	Procedure	Corrective Actions
Mark deleted rows in HAWQ system catalogs (tables in the `pg_catalog` schema) so that the space they occupy can be reused. Recommended frequency: daily Severity: CRITICAL	Vacuum each system catalog: `VACUUM <table>;`	Vacuum system catalogs regularly to prevent bloating.
Update table statistics. Recommended frequency: after loading data and before executing queries Severity: CRITICAL	Analyze user tables: `ANALYZEDB -d <database> -a`	Analyze updated tables regularly so that the optimizer can produce efficient query execution plans.
Backup the database data. Recommended frequency: daily, or as required by your backup plan Severity: CRITICAL	See Backing up and Restoring HAWQ Databases for a discussion of backup procedures	Best practice is to have a current backup ready in case the database must be restored.
Reindex system catalogs (tables in the `pg_catalog` schema) to maintain an efficient catalog. Recommended frequency: weekly, or more often if database objects are created and dropped frequently	Run `REINDEX SYSTEM` in each database. `REINDEXDB -s`	The optimizer retrieves information from the system tables to create query plans. If system tables and indexes are allowed to become bloated over time, scanning the system tables increases query execution time.

Patching and Upgrading

Table 5. Patch and Upgrade Activities
Activity	Procedure	Corrective Actions
Ensure any bug fixes or enhancements are applied to the kernel. Recommended frequency: at least every 6 months Severity: IMPORTANT	Follow the vendor’s instructions to update the Linux kernel.	Keep the kernel current to include bug fixes and security fixes, and to avoid difficult future upgrades.
Install HAWQ minor releases. Recommended frequency: quarterly Severity: IMPORTANT	Always upgrade to the latest in the series.	Keep the HAWQ software current to incorporate bug fixes, performance enhancements, and feature enhancements into your HAWQ cluster.