HBase SQL Connector

HBase SQL Connector

Scan Source: Bounded Lookup Source: Sync Mode Sink: Batch Sink: Streaming Upsert Mode

The HBase connector allows for reading from and writing to an HBase cluster. This document describes how to setup the HBase Connector to run SQL queries against HBase.

HBase always works in upsert mode for exchange changelog messages with the external system using a primary key defined on the DDL. The primary key must be defined on the HBase rowkey field (rowkey field must be declared). If the PRIMARY KEY clause is not declared, the HBase connector will take rowkey as the primary key by default.

Dependencies

In order to setup the HBase connector, the following table provide dependency information for both projects using a build automation tool (such as Maven or SBT) and SQL Client with SQL JAR bundles.

HBase Version	Maven dependency	SQL Client JAR
1.4.x	`flink-connector-hbase_2.11`	Unsupported

Note: To use HBase connector in SQL Client or Flink cluster, it’s highly recommended to add HBase dependency jars to Hadoop classpath. Flink will load all jars under Hadoop classpath automatically, please refer to HBase, MapReduce, and the CLASSPATH about how to add HBase dependency jars to Hadoop classpath.

How to use HBase table

All the column families in HBase table must be declared as ROW type, the field name maps to the column family name, and the nested field names map to the column qualifier names. There is no need to declare all the families and qualifiers in the schema, users can declare what’s used in the query. Except the ROW type fields, the single atomic type field (e.g. STRING, BIGINT) will be recognized as HBase rowkey. The rowkey field can be arbitrary name, but should be quoted using backticks if it is a reserved keyword.

-- register the HBase table 'mytable' in Flink SQL
CREATE TABLE hTable (
 rowkey INT,
 family1 ROW<q1 INT>,
 family2 ROW<q2 STRING, q3 BIGINT>,
 family3 ROW<q4 DOUBLE, q5 BOOLEAN, q6 STRING>,
 PRIMARY KEY (rowkey) NOT ENFORCED
) WITH (
 'connector' = 'hbase-1.4',
 'table-name' = 'mytable',
 'zookeeper.quorum' = 'localhost:2181'
);
-- use ROW(...) construction function construct column families and write data into the HBase table.
-- assuming the schema of "T" is [rowkey, f1q1, f2q2, f2q3, f3q4, f3q5, f3q6]
INSERT INTO hTable
SELECT rowkey, ROW(f1q1), ROW(f2q2, f2q3), ROW(f3q4, f3q5, f3q6) FROM T;
-- scan data from the HBase table
SELECT rowkey, family1, family3.q4, family3.q6 FROM hTable;
-- temporal join the HBase table as a dimension table
SELECT * FROM myTopic
LEFT JOIN hTable FOR SYSTEM_TIME AS OF myTopic.proctime
ON myTopic.key = hTable.rowkey;

Connector Options

Option	Required	Default	Type	Description
connector	required	(none)	String	Specify what connector to use, here should be `‘hbase-1.4’`.
table-name	required	(none)	String	The name of HBase table to connect.
zookeeper.quorum	required	(none)	String	The HBase Zookeeper quorum.
zookeeper.znode.parent	optional	/hbase	String	The root dir in Zookeeper for HBase cluster.
null-string-literal	optional	null	String	Representation for null values for string fields. HBase source and sink encodes/decodes empty bytes as null values for all types except string type.
sink.buffer-flush.max-size	optional	2mb	MemorySize	Writing option, maximum size in memory of buffered rows for each writing request. This can improve performance for writing data to HBase database, but may increase the latency. Can be set to `‘0’` to disable it.
sink.buffer-flush.max-rows	optional	1000	Integer	Writing option, maximum number of rows to buffer for each writing request. This can improve performance for writing data to HBase database, but may increase the latency. Can be set to `‘0’` to disable it.
sink.buffer-flush.interval	optional	1s	Duration	Writing option, the interval to flush any buffered rows. This can improve performance for writing data to HBase database, but may increase the latency. Can be set to `‘0’` to disable it. Note, both `‘sink.buffer-flush.max-size’` and `‘sink.buffer-flush.max-rows’` can be set to `‘0’` with the flush interval set allowing for complete async processing of buffered actions.

Data Type Mapping

HBase stores all data as byte arrays. The data needs to be serialized and deserialized during read and write operation

When serializing and de-serializing, Flink HBase connector uses utility class org.apache.hadoop.hbase.util.Bytes provided by HBase (Hadoop) to convert Flink Data Types to and from byte arrays.

Flink HBase connector encodes null values to empty bytes, and decode empty bytes to null values for all data types except string type. For string type, the null literal is determined by null-string-literal option.

The data type mappings are as follows:

Flink SQL type	HBase conversion
`CHAR / VARCHAR / STRING`	`byte[] toBytes(String s)` `String toString(byte[] b)`
`BOOLEAN`	`byte[] toBytes(boolean b)` `boolean toBoolean(byte[] b)`
`BINARY / VARBINARY`	Returns `byte[]` as is.
`DECIMAL`	`byte[] toBytes(BigDecimal v)` `BigDecimal toBigDecimal(byte[] b)`
`TINYINT`	`new byte[] { val }` `bytes[0] // returns first and only byte from bytes`
`SMALLINT`	`byte[] toBytes(short val)` `short toShort(byte[] bytes)`
`INT`	`byte[] toBytes(int val)` `int toInt(byte[] bytes)`
`BIGINT`	`byte[] toBytes(long val)` `long toLong(byte[] bytes)`
`FLOAT`	`byte[] toBytes(float val)` `float toFloat(byte[] bytes)`
`DOUBLE`	`byte[] toBytes(double val)` `double toDouble(byte[] bytes)`
`DATE`	Stores the number of days since epoch as int value.
`TIME`	Stores the number of milliseconds of the day as int value.
`TIMESTAMP`	Stores the milliseconds since epoch as long value.
`ARRAY`	Not supported
`MAP / MULTISET`	Not supported
`ROW`	Not supported

HBase

HBase SQL Connector

Dependencies

How to use HBase table

Connector Options

connector

table-name

zookeeper.quorum

zookeeper.znode.parent

null-string-literal

sink.buffer-flush.max-size

sink.buffer-flush.max-rows

sink.buffer-flush.interval

Data Type Mapping