Advanced percentile aggregation
While the simple Percentile_agg()
interface will be sufficient for many users, we do provide more specific APIs for advanced users who want more control of how their percentile approximation is computed and how much space the intermediate representation uses. We currently provide implementations of the following percentile approximation algorithms:
- T-Digest – This algorithm buckets data more aggressively toward the center of the quantile range, giving it greater accuracy near the tails (i.e. 0.001 or 0.995).
- UddSketch – T his algorithm uses exponentially sized buckets to guarantee the approximation falls within a known error range, relative to the true discrete percentile.
The UddSketch
algorithm underlies the percentile_agg()
interface, it offers tunability for the size and maximum error target of the sketch, while percentile_agg
uses preset defaults.
Choosing the right algorithm for your use case
There are different tradeoffs that each algorithm makes, and different use cases where each will shine. The doc pages above each link to the research papers fully detailing the algorithms if you want all the details. However, at a higher level, here are some of the differences to consider when choosing an algorithm:
- First off, it’s interesting to note that the formal definition for a percentile is actually imprecise, and there are different methods for determining what the true percentile actually is. In Postgres, given a target percentile ‘p’,
percentile_disc
will return the smallest element of a set such that ‘p’ percent of the set is less than that element, whilepercentile_cont
will return an interpolated value between the two nearest matches for ‘p’. The difference here isn’t usually that interesting in practice, but if it matters to your use case, then keep in mind that TDigest will approximate the continuous percentile while UddSketch provides an estimate of the discrete value. - It’s also important to consider the types of percentiles you’re most interested in. In particular, TDigest is optimized to trade off more accurate estimates at the extremes with weaker estimates near the median. If your work flow involves estimating 99th percentiles, this is probably a good trade off. However if you’re more concerned about getting highly accurate median estimates, UddSketch is probably a better fit.
- UddSketch has a stable bucketing function, so it will always return the same quantile estimate for the same underlying data, regardless of how it is ordered or re-aggregated. TDigest, on the other hand, builds up incremental buckets based on the average of nearby points, which will result in (usually subtle) differences in estimates based on the same data, unless the order and batching of the aggregation is strictly controlled (which can be difficult to do in Postgres). Therefore, if having stable estimates is important to you, UddSketch will likely be required.
- Trying to calculate precise error bars for TDigest can be difficult, especially when merging multiple sub-digests into a larger one (this can come about either through summary aggregation or just parallelization of the normal point aggregate). If being able to tightly characterize your error is important, UddSketch will likely be the desired algorithm.
- That being said, the fact that UddSketch uses exponential bucketing to provide a guaranteed relative error can cause some wildly varying absolute errors if the data set covers a large range. For instance if the data is evenly distributed over the range [1,100], estimates at the high end of the percentile range would have about 100 times the absolute error of those at the low end of the range. This gets much more extreme if the data range is [0,100]. If having a stable absolute error is important to your use case, consider TDigest.
- While both implementation will likely get smaller and/or faster with future optimizations, in general UddSketch will end up with a smaller memory footprint than TDigest, and a correspondingly smaller disk footprint for any continuous aggregates. This is one of the main reasons that the default
percentile_agg
uses UddSketch, and is a pretty good reason to prefer that algorithm if your use case doesn’t clearly benefit from TDigest. Regardless of the algorithm, the best way to improve the accuracy of your percentile estimates is to increase the number of buckets, and UddSketch gives you more leeway to do so.