Hive HLL UDF

The Hive HLL UDF provides a set of UDFs for generating HLL operations in Hive tables, which are identical to Doris HLL. Hive HLL can be imported into Doris through Spark HLL Load. For more information about HLL, please refer to Using HLL for Approximate Deduplication.:Approximate Deduplication Using HLL

Function Introduction:

  1. UDAF

    · to_hll: An aggregate function that returns a Doris HLL column, similar to the to_bitmap function

    · hll_union:An aggregate function that calculates the union of groups, returning a Doris HLL column, similar to the bitmap_union function

  2. UDF

    · hll_cardinality: Returns the number of distinct elements added to the HLL, similar to the bitmap_count function

    Main Purpose:

  3. Reduce data import time to Doris by eliminating the need for dictionary construction and HLL pre-aggregation

  4. Save Hive storage by compressing data using HLL, significantly reducing storage costs compared to Bitmap statistics

  5. Provide flexible HLL operations in Hive, including union and cardinality statistics, and allow the resulting HLL to be directly imported into Doris

    Note: HLL statistics are approximate calculations with an error rate of around 1% to 2%.

Usage

Create a Hive table and insert test data

  1. -- Create a test database, e.g., hive_test
  2. use hive_test;
  3. -- Create a Hive HLL table
  4. CREATE TABLE IF NOT EXISTS `hive_hll_table`(
  5. `k1` int COMMENT '',
  6. `k2` String COMMENT '',
  7. `k3` String COMMENT '',
  8. `uuid` binary COMMENT 'hll'
  9. ) comment 'comment'
  10. -- Create a normal Hive table and insert test data
  11. CREATE TABLE IF NOT EXISTS `hive_table`(
  12. `k1` int COMMENT '',
  13. `k2` String COMMENT '',
  14. `k3` String COMMENT '',
  15. `uuid` int COMMENT ''
  16. ) comment 'comment'
  17. insert into hive_table select 1, 'a', 'b', 12345;
  18. insert into hive_table select 1, 'a', 'c', 12345;
  19. insert into hive_table select 2, 'b', 'c', 23456;
  20. insert into hive_table select 3, 'c', 'd', 34567;

Use Hive HLL UDF:

Hive HLL UDF needs to be used in Hive/Spark. First, compile the FE to obtain the hive-udf.jar file. Compilation preparation: If you have compiled the ldb source code, you can directly compile the FE. If not, you need to manually install thrift, refer to Setting Up Dec Env for FE - IntelliJ IDEA for compilation and installation.

  1. -- Clone the Doris source code
  2. git clone https://github.com/apache/doris.git
  3. cd doris
  4. git submodule update --init --recursive
  5. -- Install thrift (skip if already installed)
  6. -- Enter the FE directory
  7. cd fe
  8. -- Execute the Maven packaging command (all FE submodules will be packaged)
  9. mvn package -Dmaven.test.skip=true
  10. -- Or package only the hive-udf module
  11. mvn package -pl hive-udf -am -Dmaven.test.skip=true
  12. -- The packaged hive-udf.jar file will be generated in the target directory
  13. -- Upload the compiled hive-udf.jar file to HDFS, e.g., to the root directory
  14. hdfs dfs -put hive-udf/target/hive-udf.jar /

Then, enter Hive and execute the following SQL statements:

  1. -- Load the hive hll udf jar package, modify the hostname and port according to your actual situation
  2. add jar hdfs://hostname:port/hive-udf.jar;
  3. -- Create UDAF functions
  4. create temporary function to_hll as 'org.apache.doris.udf.ToHllUDAF' USING JAR 'hdfs://hostname:port/hive-udf.jar';
  5. create temporary function hll_union as 'org.apache.doris.udf.HllUnionUDAF' USING JAR 'hdfs://hostname:port/hive-udf.jar';
  6. -- Create UDF functions
  7. create temporary function hll_cardinality as 'org.apache.doris.udf.HllCardinalityUDF' USING JAR 'hdfs://node:9000/hive-udf.jar';
  8. -- Example: Use the to_hll UDAF to aggregate and generate HLL, and write it to the Hive HLL table
  9. insert into hive_hll_table
  10. select
  11. k1,
  12. k2,
  13. k3,
  14. to_hll(uuid) as uuid
  15. from
  16. hive_table
  17. group by
  18. k1,
  19. k2,
  20. k3
  21. -- Example: Use hll_cardinality to calculate the number of elements in the HLL
  22. select k1, k2, k3, hll_cardinality(uuid) from hive_hll_table;
  23. +-----+-----+-----+------+
  24. | k1 | k2 | k3 | _c3 |
  25. +-----+-----+-----+------+
  26. | 1 | a | b | 1 |
  27. | 1 | a | c | 1 |
  28. | 2 | b | c | 1 |
  29. | 3 | c | d | 1 |
  30. +-----+-----+-----+------+
  31. -- Example: Use hll_union to calculate the union of groups, returning 3 rows
  32. select k1, hll_union(uuid) from hive_hll_table group by k1;
  33. -- Example: Also can merge and then continue to statistics
  34. select k3, hll_cardinality(hll_union(uuid)) from hive_hll_table group by k3;
  35. +-----+------+
  36. | k3 | _c1 |
  37. +-----+------+
  38. | b | 1 |
  39. | c | 2 |
  40. | d | 1 |
  41. +-----+------+

Hive HLL UDF Explanation

Importing Hive HLL to Doris

Create Hive table specified as TEXT format. For Binary type, Hive will save it as a base64 encoded string. At this time, you can use the Hive Catalog to directly import the HLL data into Doris using the hll_from_base64 function.

Here is a complete example:

  1. Create a Hive table
  1. CREATE TABLE IF NOT EXISTS `hive_hll_table`(
  2. `k1` int COMMENT '',
  3. `k2` String COMMENT '',
  4. `k3` String COMMENT '',
  5. `uuid` binary COMMENT 'hll'
  6. ) stored as textfile
  7. -- then reuse the previous steps to insert data from a normal table into it using the to_hll function
  1. Create a Doris catalog
  1. CREATE CATALOG hive PROPERTIES (
  2. 'type'='hms',
  3. 'hive.metastore.uris' = 'thrift://127.0.0.1:9083'
  4. );
  1. Create a Doris internal table
  1. CREATE TABLE IF NOT EXISTS `doris_test`.`doris_hll_table`(
  2. `k1` int COMMENT '',
  3. `k2` varchar(10) COMMENT '',
  4. `k3` varchar(10) COMMENT '',
  5. `uuid` HLL HLL_UNION COMMENT 'hll'
  6. )
  7. AGGREGATE KEY(k1, k2, k3)
  8. DISTRIBUTED BY HASH(`k1`) BUCKETS 1
  9. PROPERTIES (
  10. "replication_allocation" = "tag.location.default: 1"
  11. );
  1. Import data from Hive to Doris
  1. insert into doris_hll_table select k1, k2, k3, hll_from_base64(uuid) from hive.hive_test.hive_hll_table;
  2. -- View the imported data, combining hll_to_base64 for decoding
  3. select *, hll_to_base64(uuid) from doris_hll_table;
  4. +------+------+------+------+---------------------+
  5. | k1 | k2 | k3 | uuid | hll_to_base64(uuid) |
  6. +------+------+------+------+---------------------+
  7. | 1 | a | b | NULL | AQFw+a9MhpKhoQ== |
  8. | 1 | a | c | NULL | AQFw+a9MhpKhoQ== |
  9. | 2 | b | c | NULL | AQGyB7kbWBxh+A== |
  10. | 3 | c | d | NULL | AQFYbJB5VpNBhg== |
  11. +------+------+------+------+---------------------+
  12. -- Also can use Doris's native HLL functions for statistics, and see that the results are consistent with the previous statistics in Hive
  13. select k3, hll_cardinality(hll_union(uuid)) from doris_hll_table group by k3;
  14. +------+----------------------------------+
  15. | k3 | hll_cardinality(hll_union(uuid)) |
  16. +------+----------------------------------+
  17. | b | 1 |
  18. | d | 1 |
  19. | c | 2 |
  20. +------+----------------------------------+
  21. -- At this time, querying the external table data, i.e., the data before import, can also verify the correctness of the data
  22. select k3, hll_cardinality(hll_union(hll_from_base64(uuid))) from hive.hive_test.hive_hll_table group by k3;
  23. +------+---------------------------------------------------+
  24. | k3 | hll_cardinality(hll_union(hll_from_base64(uuid))) |
  25. +------+---------------------------------------------------+
  26. | d | 1 |
  27. | b | 1 |
  28. | c | 2 |
  29. +------+---------------------------------------------------+

Method 2: Spark Load

See details: Spark Load -> Basic operation -> Creating Load (Example 3: when the upstream data source is hive binary type table)