Hive UDF

Hive Bitmap UDF provides UDFs for generating bitmap and bitmap operations in hive tables. The bitmap in Hive is exactly the same as the Doris bitmap. The bitmap in Hive can be imported into doris through (spark bitmap load).

the main purpose:

  1. Reduce the time of importing data into doris, and remove processes such as dictionary building and bitmap pre-aggregation;
  2. Save hive storage, use bitmap to compress data, reduce storage cost;
  3. Provide flexible bitmap operations in hive, such as: intersection, union, and difference operations, and the calculated bitmap can also be directly imported into doris; imported into doris;

How To Use

Create Bitmap type table in Hive

  1. -- Example: Create Hive Bitmap Table
  2. CREATE TABLE IF NOT EXISTS `hive_bitmap_table`(
  3. `k1` int COMMENT '',
  4. `k2` String COMMENT '',
  5. `k3` String COMMENT '',
  6. `uuid` binary COMMENT 'bitmap'
  7. ) comment 'comment'
  8. -- ExampleCreate Hive Table
  9. CREATE TABLE IF NOT EXISTS `hive_table`(
  10. `k1` int COMMENT '',
  11. `k2` String COMMENT '',
  12. `k3` String COMMENT '',
  13. `uuid` int COMMENT ''
  14. ) comment 'comment'

Hive Bitmap UDF Usage:

Hive Bitmap UDF used in Hive/Spark,First, you need to compile fe to get hive-udf-jar-with-dependencies.jar. Compilation preparation:If you have compiled the ldb source code, you can directly compile fe,If you have compiled the ldb source code, you can compile it directly. If you have not compiled the ldb source code, you need to manually install thrift, Reference:Setting Up dev env for FE.

  1. --clone doris code
  2. git clone https://github.com/apache/doris.git
  3. cd doris
  4. git submodule update --init --recursive
  5. --install thrift
  6. --Enter the fe directory
  7. cd fe
  8. --Execute the maven packaging commandAll sub modules of fe will be packaged
  9. mvn package -Dmaven.test.skip=true
  10. --You can also just package the hive-udf module
  11. mvn package -pl hive-udf -am -Dmaven.test.skip=true

After packaging and compiling, enter the hive-udf directory and there will be a target directory,There will be hive-udf.jar package

  1. -- Load the Hive Bitmap Udf jar package (Upload the compiled hive-udf jar package to HDFS)
  2. add jar hdfs://node:9001/hive-udf-jar-with-dependencies.jar;
  3. -- Create Hive Bitmap UDAF function
  4. create temporary function to_bitmap as 'org.apache.doris.udf.ToBitmapUDAF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
  5. create temporary function bitmap_union as 'org.apache.doris.udf.BitmapUnionUDAF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
  6. -- Create Hive Bitmap UDF function
  7. create temporary function bitmap_count as 'org.apache.doris.udf.BitmapCountUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
  8. create temporary function bitmap_and as 'org.apache.doris.udf.BitmapAndUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
  9. create temporary function bitmap_or as 'org.apache.doris.udf.BitmapOrUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
  10. create temporary function bitmap_xor as 'org.apache.doris.udf.BitmapXorUDF' USING JAR 'hdfs://node:9001/hive-udf-jar-with-dependencies.jar';
  11. -- Example: Generate bitmap by to_bitmap function and write to Hive Bitmap table
  12. insert into hive_bitmap_table
  13. select
  14. k1,
  15. k2,
  16. k3,
  17. to_bitmap(uuid) as uuid
  18. from
  19. hive_table
  20. group by
  21. k1,
  22. k2,
  23. k3
  24. -- Example: The bitmap_count function calculate the number of elements in the bitmap
  25. select k1,k2,k3,bitmap_count(uuid) from hive_bitmap_table
  26. -- Example: The bitmap_union function calculate the grouped bitmap union
  27. select k1,bitmap_union(uuid) from hive_bitmap_table group by k1

Hive Bitmap UDF Description

Hive Bitmap import into Doris

When create a Hive table in the format specified as TEXT, for Binary type, Hive will be saved as a bash64 encoded string. Therefore, the binary data can be directly saved as Bitmap through bitmap_from_base64 function by using Doris’s Hive Catalog.

Here is a full example:

  1. Creating Hive Tables in Hive
  1. CREATE TABLE IF NOT EXISTS `test`.`hive_bitmap_table`(
  2. `k1` int COMMENT '',
  3. `k2` String COMMENT '',
  4. `k3` String COMMENT '',
  5. `uuid` binary COMMENT 'bitmap'
  6. ) stored as textfile
  1. Creating a Catalog in Doris
  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://127.0.0.1:9083'
  4. );
  1. Create Doris internal table
  1. CREATE TABLE IF NOT EXISTS `test`.`doris_bitmap_table`(
  2. `k1` int COMMENT '',
  3. `k2` String COMMENT '',
  4. `k3` String COMMENT '',
  5. `uuid` BITMAP BITMAP_UNION COMMENT 'bitmap'
  6. )
  7. AGGREGATE KEY(k1, k2, k3)
  8. DISTRIBUTED BY HASH(`user_id`) BUCKETS 1
  9. PROPERTIES (
  10. "replication_allocation" = "tag.location.default: 1"
  11. );
  1. Inserting data from Hive into Doris
  1. insert into doris_bitmap_table select k1, k2, k3, bitmap_from_base64(uuid) from hive.test.hive_bitmap_table;

Method 2: Spark Load

see details: Spark Load -> Basic operation -> Create load(Example 3: when the upstream data source is hive binary type table)