7.22. KHyperLogLog Functions

Presto implements the KHyperLogLogalgorithm and data structure. KHyperLogLog data structure can be createdthrough khyperloglog_agg().

Data Structures

KHyperLogLog is a data sketch that compactly represents the association of twocolumns. It is implemented in Presto as a two-level data structure composed ofa MinHash structure whose entries map to HyperLogLog.

Serialization

KHyperLogLog sketches can be cast to and from varbinary. This allows them tobe stored for later use.

Functions

  • khyperloglog_agg(x, y) → KHyperLogLog
  • Returns the KHyperLogLog sketch that represents the relationship betweencolumns x and y. The MinHash structure summarizes x and the HyperLogLogsketches represent y values linked to x values.
  • cardinality(khll) → bigint
  • This calculates the cardinality of the MinHash sketch, i.e. x’s cardinality.
  • intersection_cardinality(khll1, khll2) → bigint
  • Returns the set intersection cardinality of the data represented by the MinHashstructures of khll1 and khll2.
  • jaccard_index(khll1, khll2) → double
  • Returns the Jaccard index of the data represented by the MinHash structures ofkhll1 and khll2.
  • uniqueness_distribution(khll) → map<bigint,double>
  • For a certain value x', uniqueness is understood as how many y' values areassociated with it in the source dataset. This is obtained with the cardinalityof the HyperLogLog that is mapped from the MinHash bucket that corresponds tox'. This function returns a histogram that represents the uniquenessdistribution, the X-axis being the uniqueness and the Y-axis being the relativefrequency of x values.
  • uniqueness_distribution(khll, histogramSize) → map<bigint,double>
  • Returns the uniqueness histogram with the given amount of buckets. If omitted,the value defaults to 256. All uniqueness values greater than histogramSize areaccumulated in the last bucket.
  • reidentification_potential(khll, threshold) → double
  • The reidentification potential is the ratio of x values that have auniqueness under the given threshold.
  • merge(khll) → KHyperLogLog
  • Returns the KHyperLogLog of the aggregate union of the individual KHyperLogLogstructures.
  • merge_khll(array(khll)) → KHyperLogLog
  • Returns the KHyperLogLog of the union of an array of KHyperLogLog structures.