Advanced analysis using SQL and TimescaleDB continuous aggregates and hyperfunctions

So far in this tutorial, you have ingested the data and run an aggregate query. Then you improved the performance of your analysis with continuous aggregates.

Now, let’s go over some ideas on analyzing the data using PostgreSQL and TimescaleDB, to help you understand more about player activity during the NFL season.

tip

Some of this analysis includes visualizations to help you see the potential uses of this data. These are created using the Matplotlib Python module, which is one of many great visualization tools.

Average yards run for a player over a game

This query uses a percentile approximation hyperfunction to find the mean yards run per game by a single player.

  1. WITH sum_yards AS (
  2. SELECT a.player_id, display_name, SUM(yards) AS yards, gameid
  3. FROM player_yards_by_game a
  4. LEFT JOIN player p ON a.player_id = p.player_id
  5. GROUP BY a.player_id, display_name, gameid
  6. )
  7. SELECT player_id, display_name, mean(percentile_agg(yards)) as yards
  8. FROM sum_yards
  9. GROUP BY player_id, display_name
  10. ORDER BY yards DESC

Your data should look like this:

player_iddisplay_nameyards
NULLNULL2872.5647430830086
2508061Antonio Brown1125.1706666666641
2556214Tyreek Hill1007.1073333333323
2543495Davante Adams971.6339999999967
2543498Brandin Cooks969.3762499999964

When you run this query you might notice that the player_id and display_name are null for the first row. This row represents the average yard data for the football.

Average and median yards run per game by type of player (not taking avg of individual)

For this query, you will use another one of the TimescaleDB percentile functions called percentile_agg. You can use the percentile_agg function to find the 50th percentile which is the approximate median.

  1. WITH sum_yards AS (
  2. --Add position to the table to allow for grouping by it later
  3. SELECT a.player_id, display_name, SUM(yards) AS yards, p.position, gameid
  4. FROM player_yards_by_game a
  5. LEFT JOIN player p ON a.player_id = p.player_id
  6. GROUP BY a.player_id, display_name, p.position, gameid
  7. )
  8. --Find the mean and median for each position type
  9. SELECT position, mean(percentile_agg(yards)) AS mean_yards, approx_percentile(0.5, percentile_agg(yards)) AS median_yards
  10. FROM sum_yards
  11. GROUP BY position
  12. ORDER BY mean_yards DESC

If you scroll to the bottom of your results you should see this:

positionmean_yardsmedian_yards
HB275.04279069767404250.88667462709043
DE185.7616201117313333.750683636185684
FB100.3791284403669167.0876116670915
DT19.69249999999999217.796475991050432

Notice how the Defensive End (DE) position has a large discrepancy between its mean and median values. The median data implies that most DE players do not run very much during passing plays. However, the mean data implies that some of the DE players must be running a significant amount.

Number of snap plays by player where they were on the offense

In this query, you are counting the number of passing events a player was involved in while playing offense. You might notice how much slower this query runs than the ones above which use continuous aggregates. The speed you see here is comparable to what you would get in the other queries without using continuous aggregates.

  1. WITH snap_events AS (
  2. -- Create a table that filters the play events to show only snap plays
  3. -- and display the players team information
  4. SELECT DISTINCT player_id, t.event, t.gameid, t.playid,
  5. CASE
  6. WHEN t.team = 'away' THEN g.visitor_team
  7. WHEN t.team = 'home' THEN g.home_team
  8. ELSE NULL
  9. END AS team_name
  10. FROM tracking t
  11. LEFT JOIN game g ON t.gameid = g.game_id
  12. WHERE t.event LIKE '%snap%'
  13. )
  14. -- Count these events and filter results to only display data when the player was
  15. -- on the offensive
  16. SELECT a.player_id, pl.display_name, COUNT(a.event) AS play_count, a.team_name
  17. FROM snap_events a
  18. LEFT JOIN play p ON a.gameid = p.gameid AND a.playid = p.playid
  19. LEFT JOIN player pl ON a.player_id = pl.player_id
  20. WHERE a.team_name = p.possessionteam
  21. GROUP BY a.player_id, pl.display_name, a.team_name
  22. ORDER BY play_count DESC

Notice that the two highest passing plays are for Ben Roethlisberger and JuJu Smith-Schuster, a Quarterback and Wide Receiver, respectively, for the Pittsburgh Steelers.

player_iddisplay_nameplay_countteam
2506109Ben Roethlisberger725PIT
2558149JuJu Smith-Schuster691PIT
2533031Andrew Luck683IND

Number of plays vs points scored

Use this query to get data on the number of plays and final score for each game during the 2018 season. This data is separated by team so that we can compare the number of plays with a team’s win or loss.

  1. WITH play_count AS (
  2. -- Count distinct plays, join on the stadium and game tables for team names and game date
  3. SELECT gameid, COUNT(playdescription) AS plays, p.possessionteam as team_name, g.game_date
  4. FROM play p
  5. LEFT JOIN game g ON p.gameid = g.game_id
  6. GROUP BY gameid, team_name, game_date
  7. ), visiting_games AS (
  8. -- Join on scores to grab only the visting team's data
  9. SELECT gameid, plays, s.visitor_team as team_name, s.visitor_score AS team_score FROM play_count p
  10. INNER JOIN scores s ON p.team_name = s.visitor_team_abb
  11. AND p.game_date = s."date"
  12. ), home_games AS (
  13. -- Join on scores to grab only the home team's data
  14. SELECT gameid, plays, s.home_team AS team_name , s.home_score AS team_score FROM play_count p
  15. INNER JOIN scores s ON p.team_name = s.home_team_abb
  16. AND p.game_date = s."date"
  17. )
  18. -- union the two resulting tables together
  19. SELECT * FROM visiting_games
  20. UNION ALL
  21. SELECT * FROM home_games
  22. ORDER BY gameid ASC, team_score DESC

The image below is an example of a visualization that you could create with the results from this query. The scatter plot is grouped, showing the winning team’s plays and scores as gold, and the losing team’s plays and scores as brown.

The y-axis, or the number of plays for one team during a single game, shows that more passing plays do not always imply a guaranteed win. In fact, the top three teams with the highest number of plays for a single game all appeared to have lost. (In football, this makes logical sense, as teams that are behind late in the game tend to pass more.) There are many interesting facts which you could glean from this type of query, this scatter plot is just one possibility.

Average yards per game for top three players of each position

You can use this PostgreSQL query to extract the average yards run by an individual player over one game. This query only includes the top three highest players’ average yard values per position type. The data is ordered by the average yards run across all players for each position. This becomes important later on.

Note: This query excludes some position types from the list due to such low average yard values, the excluded positions are Kicker, Punter, Nose Tackle, Long Snapper, and Defensive Tackle

  1. WITH total_yards AS (
  2. -- This table sums the yards a player runs over each game
  3. SELECT t.player_id, SUM(yards) AS yards, t.gameid
  4. FROM player_yards_by_game t
  5. GROUP BY t.player_id, t.gameid
  6. ), avg_yards AS (
  7. -- This table takes the average of the yards run by each player and calls out thier position
  8. SELECT p.player_id, p.display_name, AVG(yards) AS avg_yards, p."position"
  9. FROM total_yards t
  10. LEFT JOIN player p ON t.player_id = p.player_id
  11. GROUP BY p.player_id, p.display_name, p."position"
  12. ), ranked_vals AS (
  13. -- This table ranks each player by the average yards they run per game
  14. SELECT a.*, RANK() OVER (PARTITION BY a."position" ORDER BY avg_yards DESC)
  15. FROM avg_yards AS a
  16. ), ranked_positions AS (
  17. -- This table takes the average of the average yards run for each player so that we can order
  18. -- the positions by this average of averages
  19. SELECT v."position", AVG(v.avg_yards) AS avg_yards_positions
  20. FROM ranked_vals v
  21. GROUP BY v."position"
  22. )
  23. SELECT v.*, p.avg_yards_positions FROM ranked_vals v
  24. LEFT JOIN ranked_positions p ON v.position = p.position
  25. WHERE v.rank <= 3 AND v.position != 'null' AND v.position NOT IN ('K', 'P', 'NT', 'LS', 'DT')
  26. ORDER BY p.avg_yards_positions DESC, v.rank ASC

This is one possible visualization that you could create with this data:

Notice that the average yards overall for Free Safety players is higher than that of Wide Receivers (this is because of how we ordered the data, noted above). However, individual Wide Receivers run more yards on average per game. Also, notice that Kyle Juszczyk runs far more on average than other Fullback players.

It’s only halftime!

These example queries are just the beginning examples of the analysis you can perform on any time-series data with regular SQL and helpful features like continuous aggregates. Consider joining in stadium data that we provided to see if teams tend to score or run less at Mile High Stadium. Does natural or artificial turf affect any teams consistently?