Best Practices for Operating HAWQ
This topic provides best practices for operating HAWQ, including recommendations for stopping, starting and monitoring HAWQ.
Best Practices Using the Command Line to Start/Stop HAWQ Cluster Members
The following best practices are recommended when using hawq start
and hawq stop
to manage your HAWQ cluster.
- Always use HAWQ management commands to start and stop HAWQ, instead of Postgres equivalents.
- Use
hawq start cluster
orhawq restart cluster
to start the entire cluster, rather than starting the master and individual segments. - If you do decide to use
hawq start standby|master|segment
to start nodes individually, always start the standby before the active master. Otherwise, the standby can become unsynchronized with the active master. - When stopping a cluster, issue the
CHECKPOINT
command to update and flush all data files to disk and update the log file before stopping the cluster. A checkpoint ensures that, in the event of a crash, files can be restored from the checkpoint snapshot. Stop the entire HAWQ system by stopping the cluster on the master host:
shell $ hawq stop cluster
To stop segments and kill any running queries without causing data loss or inconsistency issues, use
fast
orimmediate
mode on the cluster:$ hawq stop cluster -M fast
$ hawq stop cluster -M immediate
Use
hawq stop master
to stop the master only. If you cannot stop the master due to running transactions, try using fast shutdown. If fast shutdown does not work, use immediate shutdown. Use immediate shutdown with caution, as it will result in a crash-recovery run when the system is restarted.$ hawq stop master -M fast
$ hawq stop master -M immediate
When stopping a segment or all segments, you can use the default mode of smart mode. Using fast or immediate mode on segments will have no effect since segments are stateless.
$ hawq stop segment
$ hawq stop allsegments
Guidelines for Cluster Expansion
This topic provides some guidelines around expanding your HAWQ cluster.
There are several recommendations to keep in mind when modifying the size of your running HAWQ cluster:
- When you add a new node, install both a DataNode and a physical segment on the new node.
- After adding a new node, you should always rebalance HDFS data to maintain cluster performance.
- Adding or removing a node also necessitates an update to the HDFS metadata cache. This update will happen eventually, but can take some time. To speed the update of the metadata cache, execute
select gp_metadata_cache_clear();
. - Note that for hash distributed tables, expanding the cluster will not immediately improve performance since hash distributed tables use a fixed number of virtual segments. In order to obtain better performance with hash distributed tables, you must redistribute the table to the updated cluster by either the ALTER TABLE or CREATE TABLE AS command.
- If you are using hash tables, consider updating the
default_hash_table_bucket_number
server configuration parameter to a larger value after expanding the cluster but before redistributing the hash tables.
Best Practices for Monitoring a HAWQ System
The sections below provide recommendations for common monitoring tasks.
Note: If your HAWQ cluster is managed through the Ambari interface, use Ambari to perform the suggested monitoring tasks.
Database State Monitoring Activities
Activity | Procedure | Corrective Actions |
---|---|---|
List segments that are currently down. If any rows are returned, this should generate a warning or alert. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT | Run the following query in the postgres database:
| If the query returns any rows, follow these steps to correct the problem:
|
Hardware and Operating System Monitoring
Activity | Procedure | Corrective Actions |
---|---|---|
Underlying platform check for maintenance required or system down of the hardware. Recommended frequency: real-time, if possible, or every 15 minutes Severity: CRITICAL | Set up system check for hardware and OS errors. | If required, remove a machine from the HAWQ cluster to resolve hardware and OS issues, then add it back to the cluster after the issues are resolved. |
Check disk space usage on volumes used for HAWQ data storage and the OS. Recommended frequency: every 5 to 30 minutes Severity: CRITICAL | Set up a disk space check.
| Free space on the system by removing some data or files. |
Check for errors or dropped packets on the network interfaces. Recommended frequency: hourly Severity: IMPORTANT | Set up a network interface checks. | Work with network and OS teams to resolve errors. |
Check for RAID errors or degraded RAID performance. Recommended frequency: every 5 minutes Severity: CRITICAL | Set up a RAID check. |
|
Check for adequate I/O bandwidth and I/O skew. Recommended frequency: when create a cluster or when hardware issues are suspected. | Run the HAWQ hawq checkperf utility. | The cluster may be under-specified if data transfer rates are not similar to the following:
If the machines on the cluster display an uneven performance profile, work with the system administration team to fix faulty machines. |
Data Maintenance
Activity | Procedure | Corrective Actions |
---|---|---|
Check for missing statistics on tables. | Check the hawq_stats_missing view in each database:
| Run ANALYZE on tables that are missing statistics. |
Database Maintenance
Activity | Procedure | Corrective Actions |
---|---|---|
Mark deleted rows in HAWQ system catalogs (tables in the pg_catalog schema) so that the space they occupy can be reused.Recommended frequency: daily Severity: CRITICAL | Vacuum each system catalog:
| Vacuum system catalogs regularly to prevent bloating. |
Update table statistics. Recommended frequency: after loading data and before executing queries Severity: CRITICAL | Analyze user tables:
| Analyze updated tables regularly so that the optimizer can produce efficient query execution plans. |
Backup the database data. Recommended frequency: daily, or as required by your backup plan Severity: CRITICAL | See Backing up and Restoring HAWQ Databases for a discussion of backup procedures | Best practice is to have a current backup ready in case the database must be restored. |
Reindex system catalogs (tables in the pg_catalog schema) to maintain an efficient catalog.Recommended frequency: weekly, or more often if database objects are created and dropped frequently | Run REINDEX SYSTEM in each database.
| The optimizer retrieves information from the system tables to create query plans. If system tables and indexes are allowed to become bloated over time, scanning the system tables increases query execution time. |
Patching and Upgrading
Activity | Procedure | Corrective Actions |
---|---|---|
Ensure any bug fixes or enhancements are applied to the kernel. Recommended frequency: at least every 6 months Severity: IMPORTANT | Follow the vendor’s instructions to update the Linux kernel. | Keep the kernel current to include bug fixes and security fixes, and to avoid difficult future upgrades. |
Install HAWQ minor releases. Recommended frequency: quarterly Severity: IMPORTANT | Always upgrade to the latest in the series. | Keep the HAWQ software current to incorporate bug fixes, performance enhancements, and feature enhancements into your HAWQ cluster. |