Ingest data and run your first query

The main dataset is provided by Kaggle as multiple CSV files. Additionally, we have gathered other information about stadiums and the outcome of each game to provide you with additional data to analyze.

The data is provided in multiple CSV files, each corresponding to a table in the database that contains the following information:

  • game

    • Information about each game (home team, visitor team, week of play, and more)
    • game_id is a primary key
  • player

    • Player information (display_name, college, position, and more)
    • player_id is a primary key.
  • play

    • Play information (game, play, quarter, down, pass result). Lots of good overall play information to analyze.
    • To query a specific play, you need to use gameid and playid together, as some playid‘s are reused from game-to-game.
  • tracking

    • Player tracking information from each play sampled multiple times a second.
    • Fields include acceleration, X/Y coordinates on the field, and others.
    • x and y indicate the physical positions of the players on the field using the coordinates outlined in the data description on the Kaggle website.
    • This is the largest table (18M+ row) in the database.
  • scores

    • Final result of each game.
    • This table can be joined with the tracking table using the home_team_abb and visitor_team_abb fields.
  • stadium_info

    • Home stadium of each team and additional information like surface, roof_type, location.

Create the tables with this SQL:

  1. CREATE TABLE game (
  2. game_id INT PRIMARY KEY,
  3. game_date DATE,
  4. gametime_et TIME,
  5. home_team TEXT,
  6. visitor_team TEXT,
  7. week INT
  8. );
  9. CREATE TABLE player (
  10. player_id INT PRIMARY KEY,
  11. height TEXT,
  12. weight INT,
  13. birthDate DATE,
  14. collegeName TEXT,
  15. position TEXT,
  16. displayName TEXT
  17. );
  18. CREATE TABLE play (
  19. gameId INT,
  20. playId INT,
  21. playDescription TEXT,
  22. quarter INT,
  23. down INT,
  24. yardsToGo INT,
  25. possessionTeam TEXT,
  26. playType TEXT,
  27. yardlineSide TEXT,
  28. yardlineNumber INT,
  29. offenseFormation TEXT,
  30. personnelO TEXT,
  31. defendersInTheBox INT,
  32. numberOfPassRushers INT,
  33. personnelD TEXT,
  34. typeDropback TEXT,
  35. preSnapVisitorScore INT,
  36. preSnapHomeScore INT,
  37. gameClock TIME,
  38. absoluteYardlineNumber INT,
  39. penaltyCodes TEXT,
  40. penaltyJerseyNumber TEXT,
  41. passResult TEXT,
  42. offensePlayResult INT,
  43. playResult INT,
  44. epa DOUBLE PRECISION,
  45. isDefensivePI BOOLEAN
  46. );
  47. CREATE TABLE tracking (
  48. time TIMESTAMP,
  49. x DOUBLE PRECISION,
  50. y DOUBLE PRECISION,
  51. s DOUBLE PRECISION,
  52. a DOUBLE PRECISION,
  53. dis DOUBLE PRECISION,
  54. o DOUBLE PRECISION,
  55. dir DOUBLE PRECISION,
  56. event TEXT,
  57. nflId INT,
  58. displayName TEXT,
  59. jerseyNumber INT,
  60. position TEXT,
  61. frameId INT,
  62. team TEXT,
  63. gameid INT,
  64. player_id INT,
  65. playDirection TEXT,
  66. route TEXT
  67. );
  68. CREATE TABLE scores (
  69. scoreid INT PRIMARY KEY,
  70. date DATE,
  71. visitor_team TEXT,
  72. visitor_team_abb TEXT,
  73. visitor_score INT,
  74. home_team TEXT,
  75. home_team_abb TEXT,
  76. home_score INT
  77. );
  78. CREATE TABLE stadium_info (
  79. stadiumid INT PRIMARY KEY,
  80. stadium_name TEXT,
  81. location TEXT,
  82. surface TEXT,
  83. roof_type TEXT,
  84. team_name TEXT,
  85. team_abbreviation TEXT,
  86. time_zone TEXT
  87. )

Add indexes to the tracking table to improve query performance:

  1. CREATE INDEX idx_gameid ON tracking (gameid);
  2. CREATE INDEX idx_playerid ON tracking (player_id);
  3. CREATE INDEX idx_playid ON tracking (playid);

Create hypertable from tracking table

  1. /*
  2. tracking: name of the table
  3. time: name of the timestamp column
  4. */
  5. SELECT create_hypertable('tracking', 'time');

Ingest data from CSV files

There are three separate CSV files for game, player, and play tables. For the tracking table, you need to import data from 17 CSV files (one file for each week of the season).

You can use a Python script that uses psycopg2’s copy_expert function to ingest the data:

  1. import config
  2. import psycopg2
  3. # connect to the database
  4. conn = psycopg2.connect(database=config.DB_NAME,
  5. host=config.HOST,
  6. user=config.USER,
  7. password=config.PASS,
  8. port=config.PORT)
  9. # insert CSV file into given table
  10. def insert(csv_file, table_name):
  11. cur = conn.cursor()
  12. copy_sql = """
  13. COPY {table} FROM stdin WITH CSV HEADER
  14. DELIMITER as ','
  15. """.format(table=table_name)
  16. with open(csv_file, 'r') as f:
  17. cur.copy_expert(sql=copy_sql, file=f)
  18. conn.commit()
  19. cur.close()
  20. print("Inserting games.csv")
  21. insert("data/games.csv", "game")
  22. print("Inserting plays.csv")
  23. insert("data/plays.csv", "play")
  24. print("Inserting players.csv")
  25. insert("data/players.csv", "player")
  26. print("Inserting stadium_info.csv")
  27. insert("data/stadium_info.csv", "stadium_info")
  28. print("Inserting scores.csv")
  29. insert("data/scores.csv", "scores")
  30. # iterate over each week's CSV file and insert them
  31. for i in range(1, 18):
  32. print("Inserting week{i}".format(str(i)))
  33. insert("data/week{i}.csv".format(i=i), "tracking")
  34. conn.close()

Run your first query

Now that you have all the data ingested, you can run the first aggregation query and examine the results. For most of the example queries in this tutorial, you’ll need to aggregate data from the tracking table, which contains multiple rows per player for each play (because the data is sampled multiple times per second during each play)

Number of yards run in game for passing plays, by player and game

This query sums all yards for each player in every game. You can then join that on the player table to get player details:

  1. SELECT t.player_id, p.display_name, SUM(dis) AS yards, t.gameid
  2. FROM tracking t
  3. LEFT JOIN player p ON t.player_id = p.player_id
  4. GROUP BY t.player_id, p.display_name, t.gameid
  5. ORDER BY t.gameid ASC, yards DESC;

Your data should look like this:

player_iddisplay_nameyardsgameid
2495454Julio Jones1030.63999999999712018090600
2507763Mike Wallace940.00999999999892018090600
2552600Nelson Agholor908.02999999999832018090600
2540158Zach Ertz882.08999999999532018090600
2539653Robert Alford881.75999999999832018090600
2555383Jalen Mills856.11999999999162018090600

You might have noticed, however, that this data takes a long time to query because we have to aggregate every row in the tracking table to get the total yards of each player, in each game. That’s a lot of work for PostgreSQL to do when it needs to scan 20 million rows. On our small test machine this query often takes 25-30 seconds to run.

Faster queries with continuous aggregates

Most of the data we were interested in are based on this aggregation of the tracking data. We wanted to know how far a player traveled on each play or throughout each game. Rather than asking TimescaleDB to query and aggregate that raw data every time, we created a continuous aggregate out of this base query to significantly improve the speed of queries and analysis.

Create continuous aggregate of player yards per game

  1. CREATE MATERIALIZED VIEW player_yards_by_game
  2. WITH (timescaledb.continuous) AS
  3. SELECT t.player_id, t.gameid, t.position, t.team,
  4. time_bucket(INTERVAL '1 day', t."time") AS bucket,
  5. SUM(t.dis) AS yards,
  6. AVG(t.a) AS acc
  7. FROM tracking t
  8. GROUP BY t.player_id, t.gameid, t.position, t.team, bucket;

When you have created the continuous aggregate, modify the query to use the materialized data and notice the response time is now significantly faster, under one second on our test machine.

  1. SELECT pyg.player_id, p.display_name, SUM(yards) AS yards, pyg.gameid
  2. FROM player_yards_by_game pyg
  3. LEFT JOIN player p ON pyg.player_id = p.player_id
  4. GROUP BY pyg.player_id, p.display_name, pyg.gameid
  5. ORDER BY pyg.gameid ASC, yards DESC;

We’ll use this continuous aggregate in most of the queries in the next section. Feel free to play with other variations of this materialized data as you try to answer even more questions with TimescaleDB.