6.21. HyperLogLog Functions
Presto implements the approx_distinct()
function using theHyperLogLog data structure.
Data Structures
Presto implements HyperLogLog data sketches as a set of 32-bit buckets whichstore a maximum hash. They can be stored sparsely (as a map from bucket IDto bucket), or densely (as a contiguous memory block). The HyperLogLog datastructure starts as the sparse representation, switching to dense when it ismore efficient. The P4HyperLogLog structure is initialized densely andremains dense for its lifetime.
HyperLogLog implicitly casts to P4HyperLogLog,while one can explicitly cast HyperLogLog
to P4HyperLogLog
:
- cast(hll AS P4HyperLogLog)
Serialization
Data sketches can be serialized to and deserialized from varbinary
. Thisallows them to be stored for later use. Combined with the ability to mergemultiple sketches, this allows one to calculate approx_distinct()
of theelements of a partition of a query, then for the entirety of a query with verylittle cost.
For example, calculating the HyperLogLog
for daily unique users will allowweekly or monthly unique users to be calculated incrementally by combining thedailies. This is similar to computing weekly revenue by summing daily revenue.Uses of approx_distinct()
with GROUPING SETS
can be converted to useHyperLogLog
. Examples:
- CREATE TABLE visit_summaries (
- visit_date date,
- hll varbinary
- );
- INSERT INTO visit_summaries
- SELECT visit_date, cast(approx_set(user_id) AS varbinary)
- FROM user_visits
- GROUP BY visit_date;
- SELECT cardinality(merge(cast(hll AS HyperLogLog))) AS weekly_unique_users
- FROM visit_summaries
- WHERE visit_date >= current_date - interval '7' day;
Functions
approxset
(_x) → HyperLogLogReturns the
HyperLogLog
sketch of the input data set ofx
.The value of the maximum standard error is defaulted to0.01625
.This data sketch underliesapprox_distinct()
and can be stored andused later by callingcardinality()
.approxset
(_x, e) → HyperLogLogReturns the
HyperLogLog
sketch of the input data set ofx
, witha maximum standard error ofe
. The current implementation of thisfunction requires thate
be in the range of[0.0040625, 0.26000]
.This data sketch underliesapprox_distinct()
and can be stored andused later by callingcardinality()
.cardinality
(hll) → bigintThis will perform
approx_distinct()
on the data summarized by thehll
HyperLogLog data sketch.empty_approx_set
() → HyperLogLogReturns an empty
HyperLogLog
.The value of the maximum standard error is defaulted to0.01625
.emptyapprox_set
(_e) → HyperLogLogReturns an empty
HyperLogLog
with a maximum standard error ofe
.The current implementation of this function requires thate
be inthe range of[0.0040625, 0.26000]
.merge
(HyperLogLog) → HyperLogLogReturns the
HyperLogLog
of the aggregate union of the individualhll
HyperLogLog structures.mergehll
(_array(HyperLogLog)) → HyperLogLog- Returns the
HyperLogLog
of the union of an arrayhll
HyperLogLogstructures.