Lesson 5 - HAWQ Tables

HAWQ writes data to, and reads data from, HDFS natively. HAWQ tables are similar to tables in any relational database, except that table rows (data) are distributed across the different segments in the cluster.

In this exercise, you will run scripts that use the SQL CREATE TABLE command to create HAWQ tables. You will load the Retail demo fact data into the HAWQ tables using the SQL COPY command. You will then perform simple and complex queries on the data.

Prerequisites

Ensure that you have:

Exercise: Create, Add Data to, and Query HAWQ Retail Demo Tables

Perform the following steps to create and load HAWQ tables from the sample Retail demo data set.

  1. Navigate to the HAWQ script directory:

    1. gpadmin@master$ cd $HAWQGSBASE/tutorials/getstart/hawq
  2. Create tables for the Retail demo fact data using the script provided:

    1. gpadmin@master$ psql -f ./create_hawq_tables.sql
    2. psql:./create_hawq_tables.sql:2: NOTICE: table "order_lineitems_hawq" does not exist, skipping
    3. DROP TABLE
    4. CREATE TABLE
    5. psql:./create_hawq_tables.sql:41: NOTICE: table "orders_hawq" does not exist, skipping
    6. DROP TABLE
    7. CREATE TABLE

    Note: The create_hawq_tables.sql script deletes each table before attempting to create it. If this is your first time performing this exercise, you can safely ignore the psql “table does not exist, skipping” messages.)

  3. Let’s take a look at the create_hawq_tables.sql script; for example:

    1. gpadmin@master$ vi create_hawq_tables.sql

    Notice the use of the retail_demo. schema name prefix to the order_lineitems_hawq table name:

    1. DROP TABLE IF EXISTS retail_demo.order_lineitems_hawq;
    2. CREATE TABLE retail_demo.order_lineitems_hawq
    3. (
    4. order_id TEXT,
    5. order_item_id TEXT,
    6. product_id TEXT,
    7. product_name TEXT,
    8. customer_id TEXT,
    9. store_id TEXT,
    10. item_shipment_status_code TEXT,
    11. order_datetime TEXT,
    12. ship_datetime TEXT,
    13. item_return_datetime TEXT,
    14. item_refund_datetime TEXT,
    15. product_category_id TEXT,
    16. product_category_name TEXT,
    17. payment_method_code TEXT,
    18. tax_amount TEXT,
    19. item_quantity TEXT,
    20. item_price TEXT,
    21. discount_amount TEXT,
    22. coupon_code TEXT,
    23. coupon_amount TEXT,
    24. ship_address_line1 TEXT,
    25. ship_address_line2 TEXT,
    26. ship_address_line3 TEXT,
    27. ship_address_city TEXT,
    28. ship_address_state TEXT,
    29. ship_address_postal_code TEXT,
    30. ship_address_country TEXT,
    31. ship_phone_number TEXT,
    32. ship_customer_name TEXT,
    33. ship_customer_email_address TEXT,
    34. ordering_session_id TEXT,
    35. website_url TEXT
    36. )
    37. WITH (appendonly=true, compresstype=zlib) DISTRIBUTED RANDOMLY;

    The CREATE TABLE statement above creates a table named order_lineitems_hawq in the retail_demo schema. order_lineitems_hawq has several columns. order_id and customer_id provide keys into the orders fact and customers dimension tables. The data in order_lineitems_hawq is distributed randomly and is compressed using the zlib compression algorithm.

    The create_hawq_tables.sql script also creates the orders_hawq fact table.

  4. Take a look at the load_hawq_tables.sh script:

    1. gpadmin@master$ vi load_hawq_tables.sh

    Again, notice the use of the retail_demo. schema name prefix to the table names.

    Examine the psql -c COPY commands:

    1. zcat $DATADIR/order_lineitems.tsv.gz | psql -d hawqgsdb -c "COPY retail_demo.order_lineitems_hawq FROM STDIN DELIMITER E'\t' NULL E'';"
    2. zcat $DATADIR/orders.tsv.gz | psql -d hawqgsdb -c "COPY retail_demo.orders_hawq FROM STDIN DELIMITER E'\t' NULL E'';"

    The load_hawq_tables.sh shell script uses the zcat command to uncompress the .tsv.gz data files. The SQL COPY command copies STDIN (i.e. the output of the zcat command) to the HAWQ table. The COPY command also identifies the DELIMITER used in the file (tab) and the NULL string (“).

  5. Use the load_hawq_tables.sh script to load the Retail demo fact data into the newly-created tables. This process may take some time to complete.

    1. gpadmin@master$ ./load_hawq_tables.sh
  6. Use the provided script to verify that the Retail demo fact tables were loaded successfully:

    1. gpadmin@master$ ./verify_load_hawq_tables.sh

    The output of the verify_load_hawq_tables.sh script should match the following:

    1. Table Name | Count
    2. ------------------------------+------------------------
    3. order_lineitems_hawq | 744196
    4. orders_hawq | 512071
    5. ------------------------------+------------------------
  7. Run a query on the order_lineitems_hawq table that returns the product_id, item_quantity, item_price, and coupon_amount for all order line items associated with order id 8467975147:

    1. gpadmin@master$ psql
    2. hawqgsdb=# SELECT product_id, item_quantity, item_price, coupon_amount
    3. FROM retail_demo.order_lineitems_hawq
    4. WHERE order_id='8467975147' ORDER BY item_price;
    5. product_id | item_quantity | item_price | coupon_amount
    6. ------------+---------------+------------+---------------
    7. 1611429 | 1 | 11.38 | 0.00000
    8. 1035114 | 1 | 12.95 | 0.15000
    9. 1382850 | 1 | 17.56 | 0.50000
    10. 1562908 | 1 | 18.50 | 0.00000
    11. 1248913 | 1 | 34.99 | 0.50000
    12. 741706 | 1 | 45.99 | 0.00000
    13. (6 rows)

    The ORDER BY clause identifies the sort column, item_price. If you do not specify an ORDER BY column(s), the rows are returned in the order in which they were added to the table.

  8. Determine the top three postal codes by order revenue by running the following query on the orders_hawq table:

    1. hawqgsdb=# SELECT billing_address_postal_code,
    2. sum(total_paid_amount::float8) AS total,
    3. sum(total_tax_amount::float8) AS tax
    4. FROM retail_demo.orders_hawq
    5. GROUP BY billing_address_postal_code
    6. ORDER BY total DESC LIMIT 3;

    Notice the use of the sum() aggregate function to add the order totals (total_amount_paid) and tax totals (total_tax_paid) for all orders. These totals are grouped/summed for each billing_address_postal_code.

    Compare your output to the following:

    1. billing_address_postal_code | total | tax
    2. ----------------------------+-----------+-----------
    3. 48001 | 111868.32 | 6712.0992
    4. 15329 | 107958.24 | 6477.4944
    5. 42714 | 103244.58 | 6194.6748
    6. (3 rows)
  9. Run the following query on the orders_hawq and order_lineitems_hawq tables to display the product_id, item_quantity, and item_price for all line items identifying a product_id of 1869831:

    1. hawqgsdb=# SELECT retail_demo.order_lineitems_hawq.order_id, product_id, item_quantity, item_price
    2. FROM retail_demo.order_lineitems_hawq, retail_demo.orders_hawq
    3. WHERE retail_demo.order_lineitems_hawq.order_id=retail_demo.orders_hawq.order_id AND retail_demo.order_lineitems_hawq.product_id=1869831
    4. ORDER BY retail_demo.order_lineitems_hawq.order_id, product_id;
    5. order_id | product_id | item_quantity | item_price
    6. ------------+------------+---------------+------------
    7. 4831097728 | 1869831 | 1 | 11.87
    8. 6734073469 | 1869831 | 1 | 11.87
    9. (2 rows)
  10. Exit the psql subsystem:

    1. hawqgsdb=# \q

Summary

In this lesson, you created and loaded Retail order and order line item data into HAWQ fact tables. You also queried these tables, learning how to filter the data to your needs.

In Lesson 6, you use PXF external tables to similarly access dimension data stored in HDFS.

Lesson 6: HAWQ Extension Framework (PXF)