Aggregation

A key feature of analytics is reducing many values down to some summary. This act is called “aggregation” and always includes a function — for example, average or sum — that reduces values in the table to a single row.

aggregate transform

The aggregate transform takes a tuple to create one or more new columns that “distill down” data from all the rows.

  1. from invoices
  2. aggregate { sum_of_orders = sum total }

The query above computes the sum of the total column of all rows of the invoices table to produce a single value.

aggregate can produce multiple summaries at once when one or more aggregation expressions are contained in a tuple. aggregate discards all columns that are not present in the tuple.

  1. from invoices
  2. aggregate {
  3. num_orders = count this,
  4. sum_of_orders = sum total,
  5. }

In the example above, the result is a single row with two columns. The count function displays the number of rows in the table that was passed in; the sum function adds up the values of the total column of all rows.

Grouping

Suppose we want to produce summaries of invoices for each city in the table. We could create a query for each city, and aggregate its rows:

  1. from albums
  2. filter billing_city == "Oslo"
  3. aggregate { sum_of_orders = sum total }

But we would need to do it for each city: London, Frankfurt, etc. Of course this is repetitive (and boring) and error prone (because we would need to type each billing_city by hand). Moreover, we would need to create a list of each billing_city before we started.

group transform

The group transform separates the table into groups (say, those having the same city) using information that’s already in the table. It then applies a transform to each group, and combines the results back together:

  1. from invoices
  2. group billing_city (
  3. aggregate {
  4. num_orders = count this,
  5. sum_of_orders = sum total,
  6. }
  7. )

Those familiar with SQL have probably noticed that we just decoupled aggregation from grouping.

Although these operations are connected in SQL, PRQL makes it straightforward to use group and aggregate separate from each other, while combining with other transform functions, such as:

  1. from invoices
  2. group billing_city (
  3. take 2
  4. )

This code collects the first two rows for each city’s group.