Top N objects per group
These examples describe several ways to query the top N items per group reasonably efficiently. For a thorough discussion of various techniques, check out my blog post Querying the top N objects per group with Peewee ORM.
In these examples we will use the User and Tweet models to find each user and their three most-recent tweets.
Postgres lateral joins
Lateral joins are a neat Postgres feature that allow reasonably efficient correlated subqueries. They are often described as SQL for each
loops.
The desired SQL is:
- SELECT * FROM
- (SELECT id, username FROM user) AS uq
- LEFT JOIN LATERAL
- (SELECT message, created_date
- FROM tweet
- WHERE (user_id = uq.id)
- ORDER BY created_date DESC LIMIT 3)
- AS pq ON true
To accomplish this with peewee is quite straightforward:
- subq = (Tweet
- .select(Tweet.message, Tweet.created_date)
- .where(Tweet.user == User.id)
- .order_by(Tweet.created_date.desc())
- .limit(3))
- query = (User
- .select(User, subq.c.content, subq.c.created_date)
- .join(subq, JOIN.LEFT_LATERAL)
- .order_by(User.username, subq.c.created_date.desc()))
- # We queried from the "perspective" of user, so the rows are User instances
- # with the addition of a "content" and "created_date" attribute for each of
- # the (up-to) 3 most-recent tweets for each user.
- for row in query:
- print(row.username, row.content, row.created_date)
To implement an equivalent query from the “perspective” of the Tweet model, wecan instead write:
- # subq is the same as the above example.
- subq = (Tweet
- .select(Tweet.message, Tweet.created_date)
- .where(Tweet.user == User.id)
- .order_by(Tweet.created_date.desc())
- .limit(3))
- query = (Tweet
- .select(User.username, subq.c.content, subq.c.created_date)
- .from_(User)
- .join(subq, JOIN.LEFT_LATERAL)
- .order_by(User.username, subq.c.created_date.desc()))
- # Each row is a "tweet" instance with an additional "username" attribute.
- # This will print the (up-to) 3 most-recent tweets from each user.
- for tweet in query:
- print(tweet.username, tweet.content, tweet.created_date)
Window functions
Window functions, which are supported by peewee, provide scalable, efficient performance.
The desired SQL is:
- SELECT subq.message, subq.username
- FROM (
- SELECT
- t2.message,
- t3.username,
- RANK() OVER (
- PARTITION BY t2.user_id
- ORDER BY t2.created_date DESC
- ) AS rnk
- FROM tweet AS t2
- INNER JOIN user AS t3 ON (t2.user_id = t3.id)
- ) AS subq
- WHERE (subq.rnk <= 3)
To accomplish this with peewee, we will wrap the ranked Tweets in an outer query that performs the filtering.
- TweetAlias = Tweet.alias()
- # The subquery will select the relevant data from the Tweet and
- # User table, as well as ranking the tweets by user from newest
- # to oldest.
- subquery = (TweetAlias
- .select(
- TweetAlias.message,
- User.username,
- fn.RANK().over(
- partition_by=[TweetAlias.user],
- order_by=[TweetAlias.created_date.desc()]).alias('rnk'))
- .join(User, on=(TweetAlias.user == User.id))
- .alias('subq'))
- # Since we can't filter on the rank, we are wrapping it in a query
- # and performing the filtering in the outer query.
- query = (Tweet
- .select(subquery.c.message, subquery.c.username)
- .from_(subquery)
- .where(subquery.c.rnk <= 3))
Other methods
If you’re not using Postgres, then unfortunately you’re left with options that exhibit less-than-ideal performance. For a more complete overview of common methods, check out this blog post. Below I will summarize the approaches and the corresponding SQL.
Using COUNT
, we can get all tweets where there exist less than N tweets with more recent timestamps:
- TweetAlias = Tweet.alias()
- # Create a correlated subquery that calculates the number of
- # tweets with a higher (newer) timestamp than the tweet we're
- # looking at in the outer query.
- subquery = (TweetAlias
- .select(fn.COUNT(TweetAlias.id))
- .where(
- (TweetAlias.created_date >= Tweet.created_date) &
- (TweetAlias.user == Tweet.user)))
- # Wrap the subquery and filter on the count.
- query = (Tweet
- .select(Tweet, User)
- .join(User)
- .where(subquery <= 3))
We can achieve similar results by doing a self-join and performing the filtering in the HAVING
clause:
- TweetAlias = Tweet.alias()
- # Use a self-join and join predicates to count the number of
- # newer tweets.
- query = (Tweet
- .select(Tweet.id, Tweet.message, Tweet.user, User.username)
- .join(User)
- .switch(Tweet)
- .join(TweetAlias, on=(
- (TweetAlias.user == Tweet.user) &
- (TweetAlias.created_date >= Tweet.created_date)))
- .group_by(Tweet.id, Tweet.content, Tweet.user, User.username)
- .having(fn.COUNT(Tweet.id) <= 3))
The last example uses a LIMIT
clause in a correlated subquery.
- TweetAlias = Tweet.alias()
- # The subquery here will calculate, for the user who created the
- # tweet in the outer loop, the three newest tweets. The expression
- # will evaluate to `True` if the outer-loop tweet is in the set of
- # tweets represented by the inner query.
- query = (Tweet
- .select(Tweet, User)
- .join(User)
- .where(Tweet.id << (
- TweetAlias
- .select(TweetAlias.id)
- .where(TweetAlias.user == Tweet.user)
- .order_by(TweetAlias.created_date.desc())
- .limit(3))))