Phoenix Storage Handler for Apache Hive

The Apache Phoenix Storage Handler is a plugin that enables Apache Hive access to Phoenix tables from the Apache Hive command line using HiveQL.

Prerequisites

  • Phoenix 4.8.0+
  • Hive 1.2.1+

Hive Setup

Make phoenix-version-hive.jar available for Hive:

Step 1: Add to hive-env.sh:

  1. HIVE_AUX_JARS_PATH=<path to jar>

Step 2: Add a property to hive-site.xml so that Hive MapReduce jobs can use the .jar:

  1. <property>
  2. <name>hive.aux.jars.path</name>
  3. <value>file://<path></value>
  4. </property>

Table Creation and Deletion

The Phoenix Storage Handler supports both INTERNAL and EXTERNAL Hive tables.

Create INTERNAL Table

For INTERNAL tables, Hive manages the lifecycle of the table and data. When a Hive table is created, a corresponding Phoenix table is also created. Once the Hive table is dropped, the Phoenix table is also deleted.

  1. create table phoenix_table (
  2. s1 string,
  3. i1 int,
  4. f1 float,
  5. d1 double
  6. )
  7. STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
  8. TBLPROPERTIES (
  9. "phoenix.table.name" = "phoenix_table",
  10. "phoenix.zookeeper.quorum" = "localhost",
  11. "phoenix.zookeeper.znode.parent" = "/hbase",
  12. "phoenix.zookeeper.client.port" = "2181",
  13. "phoenix.rowkeys" = "s1, i1",
  14. "phoenix.column.mapping" = "s1:s1, i1:i1, f1:f1, d1:d1",
  15. "phoenix.table.options" = "SALT_BUCKETS=10, DATA_BLOCK_ENCODING='DIFF'"
  16. );

Create EXTERNAL Table

For EXTERNAL tables, Hive works with an existing Phoenix table and manages only Hive metadata. Dropping an EXTERNAL table from Hive deletes only Hive metadata but does not delete the Phoenix table.

  1. create external table ext_table (
  2. i1 int,
  3. s1 string,
  4. f1 float,
  5. d1 decimal
  6. )
  7. STORED BY 'org.apache.phoenix.hive.PhoenixStorageHandler'
  8. TBLPROPERTIES (
  9. "phoenix.table.name" = "ext_table",
  10. "phoenix.zookeeper.quorum" = "localhost",
  11. "phoenix.zookeeper.znode.parent" = "/hbase",
  12. "phoenix.zookeeper.client.port" = "2181",
  13. "phoenix.rowkeys" = "i1",
  14. "phoenix.column.mapping" = "i1:i1, s1:s1, f1:f1, d1:d1"
  15. );

Properties

  • phoenix.table.name
    • Specifies the Phoenix table name
    • Default: the same as the Hive table
  • phoenix.zookeeper.quorum
    • Specifies the ZooKeeper quorum for HBase
    • Default: localhost
  • phoenix.zookeeper.znode.parent
    • Specifies the ZooKeeper parent node for HBase
    • Default: /hbase
  • phoenix.zookeeper.client.port
    • Specifies the ZooKeeper port
    • Default: 2181
  • phoenix.rowkeys
    • The list of columns to be the primary key in a Phoenix table
    • Required
  • phoenix.column.mapping
    • Mappings between column names for Hive and Phoenix. See Limitations for details.

Data Ingestion, Deletions, and Updates

Data ingestion can be done by all ways that Hive and Phoenix support:

Hive:

  1. insert into table T values (....);
  2. insert into table T select c1,c2,c3 from source_table;

Phoenix:

  1. upsert into table T values (.....);
  2. Phoenix CSV BulkLoad tools

All delete and update operations should be performed on the Phoenix side. See Limitations for more details.

Additional Configuration Options

Those options can be set in a Hive command-line interface (CLI) environment.

Performance Tuning

Parameter Default Value Description
phoenix.upsert.batch.size 1000 Batch size for upsert.
[phoenix-table-name].disable.wal false Temporarily sets the table attribute DISABLE_WAL to true. Sometimes used to improve performance
[phoenix-table-name].auto.flush false When WAL is disabled and if this value is true, then MemStore is flushed to an HFile.

Query Data

You can use HiveQL for querying data in a Phoenix table. A Hive query on a single table can be as fast as running the query in the Phoenix CLI with the following property settings: hive.fetch.task.conversion=more and hive.exec.parallel=true

Parameter Default Value Description
hbase.scan.cache 100 Read row size for a unit request
hbase.scan.cacheblock false Whether or not cache block
split.by.stats false If true, mappers use table statistics. One mapper per guide post.
[hive-table-name].reducer.count 1 Number of reducers. In Tez mode, this affects only single-table queries. See Limitations.
[phoenix-table-name].query.hint Hint for Phoenix query (for example, NO_INDEX)

Limitations

  • Hive update and delete operations require transaction manager support on both Hive and Phoenix sides. Related Hive and Phoenix JIRAs are listed in the Resources section.
  • Column mapping does not work correctly with mapping row key columns.
  • MapReduce and Tez jobs always have a single reducer.

Resources

  • PHOENIX-2743 : Implementation, accepted by Apache Phoenix community. Original pull request contains modification for Hive classes.
  • PHOENIX-331 : An outdated implementation with support of Hive 0.98.

原文: http://phoenix.apache.org/hive_storage_handler.html