7.22. KHyperLogLog Functions
Presto implements the KHyperLogLogalgorithm and data structure. KHyperLogLog
data structure can be createdthrough khyperloglog_agg()
.
Data Structures
KHyperLogLog is a data sketch that compactly represents the association of twocolumns. It is implemented in Presto as a two-level data structure composed ofa MinHash structure whose entries map to HyperLogLog
.
Serialization
KHyperLogLog sketches can be cast to and from varbinary
. This allows them tobe stored for later use.
Functions
khyperloglog_agg
(x, y) → KHyperLogLog-
Returns the
KHyperLogLog
sketch that represents the relationship betweencolumnsx
andy
. The MinHash structure summarizesx
and the HyperLogLogsketches representy
values linked tox
values.
cardinality
(khll) → bigint-
This calculates the cardinality of the MinHash sketch, i.e.
x
’s cardinality.
intersection_cardinality
(khll1, khll2) → bigint-
Returns the set intersection cardinality of the data represented by the MinHashstructures of
khll1
andkhll2
.
jaccard_index
(khll1, khll2) → double-
Returns the Jaccard index of the data represented by the MinHash structures of
khll1
andkhll2
.
uniqueness_distribution
(khll) → map<bigint,double>-
For a certain value
x'
, uniqueness is understood as how manyy'
values areassociated with it in the source dataset. This is obtained with the cardinalityof the HyperLogLog that is mapped from the MinHash bucket that corresponds tox'
. This function returns a histogram that represents the uniquenessdistribution, the X-axis being theuniqueness
and the Y-axis being the relativefrequency ofx
values.
uniqueness_distribution
(khll, histogramSize) → map<bigint,double>-
Returns the uniqueness histogram with the given amount of buckets. If omitted,the value defaults to 256. All
uniqueness
values greater thanhistogramSize
areaccumulated in the last bucket.
reidentification_potential
(khll, threshold) → double-
The reidentification potential is the ratio of
x
values that have auniqueness
under the giventhreshold
.
merge
(khll) → KHyperLogLog-
Returns the
KHyperLogLog
of the aggregate union of the individualKHyperLogLog
structures.
merge_khll
(array(khll)) → KHyperLogLog-
Returns the
KHyperLogLog
of the union of an array of KHyperLogLog structures.