KHyperLogLog Functions
Presto implements the KHyperLogLog algorithm and data structure. KHyperLogLog
data structure can be created through khyperloglog_agg().
Data Structures
KHyperLogLog is a data sketch that compactly represents the association of two columns. It is implemented in Presto as a two-level data structure composed of a MinHash structure whose entries map to HyperLogLog
.
Serialization
KHyperLogLog sketches can be cast to and from varbinary
. This allows them to be stored for later use.
Functions
khyperloglog_agg(x, y) → KHyperLogLog
Returns the KHyperLogLog
sketch that represents the relationship between columns x
and y
. The MinHash structure summarizes x
and the HyperLogLog sketches represent y
values linked to x
values.
cardinality(khll) → bigint
This calculates the cardinality of the MinHash sketch, i.e. x
’s cardinality.
intersection_cardinality(khll1, khll2) → bigint
Returns the set intersection cardinality of the data represented by the MinHash structures of khll1
and khll2
.
jaccard_index(khll1, khll2) → double
Returns the Jaccard index of the data represented by the MinHash structures of khll1
and khll2
.
uniqueness_distribution(khll) → map<bigint,double>
For a certain value x'
, uniqueness is understood as how many y'
values are associated with it in the source dataset. This is obtained with the cardinality of the HyperLogLog that is mapped from the MinHash bucket that corresponds to x'
. This function returns a histogram that represents the uniqueness distribution, the X-axis being the uniqueness
and the Y-axis being the relative frequency of x
values.
uniqueness_distribution(khll, histogramSize) → map<bigint,double>
Returns the uniqueness histogram with the given amount of buckets. If omitted, the value defaults to 256. All uniqueness
values greater than histogramSize
are accumulated in the last bucket.
reidentification_potential(khll, threshold) → double
The reidentification potential is the ratio of x
values that have a uniqueness
under the given threshold
.
merge(khll) → KHyperLogLog
Returns the KHyperLogLog
of the aggregate union of the individual KHyperLogLog
structures.
merge_khll(array(khll)) → KHyperLogLog
Returns the KHyperLogLog
of the union of an array of KHyperLogLog structures.