Window functions
A Window
function refers to an aggregate function that operates on a sliding window of data that is being processed as part of a SELECT
query. Window functions make it possible to do things like:
- Perform aggregations against subsets of a result-set.
- Calculate a running total.
- Rank results.
- Compare a row value to a value in the preceding (or succeeding!) row(s).
peewee comes with support for SQL window functions, which can be created by calling Function.over()
and passing in your partitioning or ordering parameters.
For the following examples, we’ll use the following model and sample data:
class Sample(Model):
counter = IntegerField()
value = FloatField()
data = [(1, 10),
(1, 20),
(2, 1),
(2, 3),
(3, 100)]
Sample.insert_many(data, fields=[Sample.counter, Sample.value]).execute()
Our sample table now contains:
id | counter | value |
---|---|---|
1 | 1 | 10.0 |
2 | 1 | 20.0 |
3 | 2 | 1.0 |
4 | 2 | 3.0 |
5 | 3 | 100.0 |
Ordered Windows
Let’s calculate a running sum of the value
field. In order for it to be a “running” sum, we need it to be ordered, so we’ll order with respect to the Sample’s id
field:
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).over(order_by=[Sample.id]).alias('total'))
for sample in query:
print(sample.counter, sample.value, sample.total)
# 1 10. 10.
# 1 20. 30.
# 2 1. 31.
# 2 3. 34.
# 3 100 134.
For another example, we’ll calculate the difference between the current value and the previous value, when ordered by the id
:
difference = Sample.value - fn.LAG(Sample.value, 1).over(order_by=[Sample.id])
query = Sample.select(
Sample.counter,
Sample.value,
difference.alias('diff'))
for sample in query:
print(sample.counter, sample.value, sample.diff)
# 1 10. NULL
# 1 20. 10. -- (20 - 10)
# 2 1. -19. -- (1 - 20)
# 2 3. 2. -- (3 - 1)
# 3 100 97. -- (100 - 3)
Partitioned Windows
Let’s calculate the average value
for each distinct “counter” value. Notice that there are three possible values for the counter
field (1, 2, and 3). We can do this by calculating the AVG()
of the value
column over a window that is partitioned depending on the counter
field:
query = Sample.select(
Sample.counter,
Sample.value,
fn.AVG(Sample.value).over(partition_by=[Sample.counter]).alias('cavg'))
for sample in query:
print(sample.counter, sample.value, sample.cavg)
# 1 10. 15.
# 1 20. 15.
# 2 1. 2.
# 2 3. 2.
# 3 100 100.
We can use ordering within partitions by specifying both the order_by
and partition_by
parameters. For an example, let’s rank the samples by value within each distinct counter
group.
query = Sample.select(
Sample.counter,
Sample.value,
fn.RANK().over(
order_by=[Sample.value],
partition_by=[Sample.counter]).alias('rank'))
for sample in query:
print(sample.counter, sample.value, sample.rank)
# 1 10. 1
# 1 20. 2
# 2 1. 1
# 2 3. 2
# 3 100 1
Bounded windows
By default, window functions are evaluated using an unbounded preceding start for the window, and the current row as the end. We can change the bounds of the window our aggregate functions operate on by specifying a start
and/or end
in the call to Function.over()
. Additionally, Peewee comes with helper-methods on the Window
object for generating the appropriate boundary references:
Window.CURRENT_ROW
- attribute that references the current row.Window.preceding()
- specify number of row(s) preceding, or omit number to indicate all preceding rows.Window.following()
- specify number of row(s) following, or omit number to indicate all following rows.
To examine how boundaries work, we’ll calculate a running total of the value
column, ordered with respect to id
, but we’ll only look the running total of the current row and it’s two preceding rows:
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).over(
order_by=[Sample.id],
start=Window.preceding(2),
end=Window.CURRENT_ROW).alias('rsum'))
for sample in query:
print(sample.counter, sample.value, sample.rsum)
# 1 10. 10.
# 1 20. 30. -- (20 + 10)
# 2 1. 31. -- (1 + 20 + 10)
# 2 3. 24. -- (3 + 1 + 20)
# 3 100 104. -- (100 + 3 + 1)
Note
Technically we did not need to specify the end=Window.CURRENT
because that is the default. It was shown in the example for demonstration.
Let’s look at another example. In this example we will calculate the “opposite” of a running total, in which the total sum of all values is decreased by the value of the samples, ordered by id
. To accomplish this, we’ll calculate the sum from the current row to the last row.
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).over(
order_by=[Sample.id],
start=Window.CURRENT_ROW,
end=Window.following()).alias('rsum'))
# 1 10. 134. -- (10 + 20 + 1 + 3 + 100)
# 1 20. 124. -- (20 + 1 + 3 + 100)
# 2 1. 104. -- (1 + 3 + 100)
# 2 3. 103. -- (3 + 100)
# 3 100 100. -- (100)
Filtered Aggregates
Aggregate functions may also support filter functions (Postgres and Sqlite 3.25+), which get translated into a FILTER (WHERE...)
clause. Filter expressions are added to an aggregate function with the Function.filter()
method.
For an example, we will calculate the running sum of the value
field with respect to the id
, but we will filter-out any samples whose counter=2
.
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).filter(Sample.counter != 2).over(
order_by=[Sample.id]).alias('csum'))
for sample in query:
print(sample.counter, sample.value, sample.csum)
# 1 10. 10.
# 1 20. 30.
# 2 1. 30.
# 2 3. 30.
# 3 100 130.
Note
The call to filter()
must precede the call to over()
.
Reusing Window Definitions
If you intend to use the same window definition for multiple aggregates, you can create a Window
object. The Window
object takes the same parameters as Function.over()
, and can be passed to the over()
method in-place of the individual parameters.
Here we’ll declare a single window, ordered with respect to the sample id
, and call several window functions using that window definition:
win = Window(order_by=[Sample.id])
query = Sample.select(
Sample.counter,
Sample.value,
fn.LEAD(Sample.value).over(win),
fn.LAG(Sample.value).over(win),
fn.SUM(Sample.value).over(win)
).window(win) # Include our window definition in query.
for row in query.tuples():
print(row)
# counter value lead() lag() sum()
# 1 10. 20. NULL 10.
# 1 20. 1. 10. 30.
# 2 1. 3. 20. 31.
# 2 3. 100. 1. 34.
# 3 100. NULL 3. 134.
Multiple window definitions
In the previous example, we saw how to declare a Window
definition and re-use it for multiple different aggregations. You can include as many window definitions as you need in your queries, but it is necessary to ensure each window has a unique alias:
w1 = Window(order_by=[Sample.id]).alias('w1')
w2 = Window(partition_by=[Sample.counter]).alias('w2')
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).over(w1).alias('rsum'), # Running total.
fn.AVG(Sample.value).over(w2).alias('cavg') # Avg per category.
).window(w1, w2) # Include our window definitions.
for sample in query:
print(sample.counter, sample.value, sample.rsum, sample.cavg)
# counter value rsum cavg
# 1 10. 10. 15.
# 1 20. 30. 15.
# 2 1. 31. 2.
# 2 3. 34. 2.
# 3 100 134. 100.
Similarly, if you have multiple window definitions that share similar definitions, it is possible to extend a previously-defined window definition. For example, here we will be partitioning the data-set by the counter value, so we’ll be doing our aggregations with respect to the counter. Then we’ll define a second window that extends this partitioning, and adds an ordering clause:
w1 = Window(partition_by=[Sample.counter]).alias('w1')
# By extending w1, this window definition will also be partitioned
# by "counter".
w2 = Window(extends=w1, order_by=[Sample.value.desc()]).alias('w2')
query = (Sample
.select(Sample.counter, Sample.value,
fn.SUM(Sample.value).over(w1).alias('group_sum'),
fn.RANK().over(w2).alias('revrank'))
.window(w1, w2)
.order_by(Sample.id))
for sample in query:
print(sample.counter, sample.value, sample.group_sum, sample.revrank)
# counter value group_sum revrank
# 1 10. 30. 2
# 1 20. 30. 1
# 2 1. 4. 2
# 2 3. 4. 1
# 3 100. 100. 1
Frame types: RANGE vs ROWS vs GROUPS
Depending on the frame type, the database will process ordered groups differently. Let’s create two additional Sample
rows to visualize the difference:
>>> Sample.create(counter=1, value=20.)
<Sample 6>
>>> Sample.create(counter=2, value=1.)
<Sample 7>
Our table now contains:
id | counter | value |
---|---|---|
1 | 1 | 10.0 |
2 | 1 | 20.0 |
3 | 2 | 1.0 |
4 | 2 | 3.0 |
5 | 3 | 100.0 |
6 | 1 | 20.0 |
7 | 2 | 1.0 |
Let’s examine the difference by calculating a “running sum” of the samples, ordered with respect to the counter
and value
fields. To specify the frame type, we can use either:
The behavior of RANGE
, when there are logical duplicates, may lead to unexpected results:
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).over(
order_by=[Sample.counter, Sample.value],
frame_type=Window.RANGE).alias('rsum'))
for sample in query.order_by(Sample.counter, Sample.value):
print(sample.counter, sample.value, sample.rsum)
# counter value rsum
# 1 10. 10.
# 1 20. 50.
# 1 20. 50.
# 2 1. 52.
# 2 1. 52.
# 2 3. 55.
# 3 100 155.
With the inclusion of the new rows we now have some rows that have duplicate category
and value
values. The RANGE
frame type causes these duplicates to be evaluated together rather than separately.
The more expected result can be achieved by using ROWS
as the frame-type:
query = Sample.select(
Sample.counter,
Sample.value,
fn.SUM(Sample.value).over(
order_by=[Sample.counter, Sample.value],
frame_type=Window.ROWS).alias('rsum'))
for sample in query.order_by(Sample.counter, Sample.value):
print(sample.counter, sample.value, sample.rsum)
# counter value rsum
# 1 10. 10.
# 1 20. 30.
# 1 20. 50.
# 2 1. 51.
# 2 1. 52.
# 2 3. 55.
# 3 100 155.
Peewee uses these rules for determining what frame-type to use:
- If the user specifies a
frame_type
, that frame type will be used. - If
start
and/orend
boundaries are specified Peewee will default to usingROWS
. - If the user did not specify frame type or start/end boundaries, Peewee will use the database default, which is
RANGE
.
The Window.GROUPS
frame type looks at the window range specification in terms of groups of rows, based on the ordering term(s). Using GROUPS
, we can define the frame so it covers distinct groupings of rows. Let’s look at an example:
query = (Sample
.select(Sample.counter, Sample.value,
fn.SUM(Sample.value).over(
order_by=[Sample.counter, Sample.value],
frame_type=Window.GROUPS,
start=Window.preceding(1)).alias('gsum'))
.order_by(Sample.counter, Sample.value))
for sample in query:
print(sample.counter, sample.value, sample.gsum)
# counter value gsum
# 1 10 10
# 1 20 50
# 1 20 50 (10) + (20+0)
# 2 1 42
# 2 1 42 (20+20) + (1+1)
# 2 3 5 (1+1) + 3
# 3 100 103 (3) + 100
As you can hopefully infer, the window is grouped by its ordering term, which is (counter, value)
. We are looking at a window that extends between one previous group and the current group.
Note
For information about the window function APIs, see:
For general information on window functions, read the postgres window functions tutorial
Additionally, the postgres docs and the sqlite docs contain a lot of good information.