完成NYC测试

NYC Taxi(纽约出租车) 数据集收集了纽约市数十亿出租车行程的详细信息,包括接送日期/时间、接送地点、行程距离、详细票价、费率、支付类型和乘客数量。
通过本教程,您将了解如何使用MatrixOne来完成对NYC Taxi数据集的查询。

该数据集的详细信息以及下载教程可参见:
* https://github.com/toddwschneider/nyc-taxi-data

准备工作

确保你已经安装了单机版MatrixOne连接到MatrixOne服务.

1. 下载并导入数据

本节教程参考https://github.com/toddwschneider/nyc-taxi-data,您可以从此处了解关于数据集下载和导入的详细信息。
该数据集共有17亿行数据,占用450 GB的空间,请确保有足够的空间来存放数据。

  • 安装PostgreSQL与PostGIS

在Mac上,两者均可通过Homebrew安装。

  • 下载原始数据
  1. ./download_raw_data.sh && ./remove_bad_rows.sh

remove_bad_rows.sh脚本修复了两个特定的文件,其中数据极少但是列字段很多。

  • 初始化数据库
  1. ./initialize_database.sh
  • 导入数据
  1. ./import_trip_data.sh
  2. ./import_fhv_trip_data.sh
  • 可选操作:下载并导入2014年Uber数据

FiveThirtyEight Uber数据集包含了2014年4月至9月的Uber出行记录。Uber和其他FHV (Lyft、Juno、Via等)的数据自2015年1月以来就被存储在TLC中。

  1. ./download_raw_2014_uber_data.sh
  2. ./import_2014_uber_trip_data.sh

2. 从PostgreSQL导出数据

  1. COPY
  2. (
  3. SELECT trips.id,
  4. trips.vendor_id,
  5. trips.pickup_datetime,
  6. trips.dropoff_datetime,
  7. trips.store_and_fwd_flag,
  8. trips.rate_code_id,
  9. trips.pickup_longitude,
  10. trips.pickup_latitude,
  11. trips.dropoff_longitude,
  12. trips.dropoff_latitude,
  13. trips.passenger_count,
  14. trips.trip_distance,
  15. trips.fare_amount,
  16. trips.extra,
  17. trips.mta_tax,
  18. trips.tip_amount,
  19. trips.tolls_amount,
  20. trips.ehail_fee,
  21. trips.improvement_surcharge,
  22. trips.total_amount,
  23. trips.payment_type,
  24. trips.trip_type,
  25. trips.pickup_location_id,
  26. trips.dropoff_location_id,
  27. cab_types.type cab_type,
  28. weather.precipitation rain,
  29. weather.snow_depth,
  30. weather.snowfall,
  31. weather.max_temperature max_temp,
  32. weather.min_temperature min_temp,
  33. weather.average_wind_speed wind,
  34. pick_up.gid pickup_nyct2010_gid,
  35. pick_up.ctlabel pickup_ctlabel,
  36. pick_up.borocode pickup_borocode,
  37. pick_up.boroname pickup_boroname,
  38. pick_up.ct2010 pickup_ct2010,
  39. pick_up.boroct2010 pickup_boroct2010,
  40. pick_up.cdeligibil pickup_cdeligibil,
  41. pick_up.ntacode pickup_ntacode,
  42. pick_up.ntaname pickup_ntaname,
  43. pick_up.puma pickup_puma,
  44. drop_off.gid dropoff_nyct2010_gid,
  45. drop_off.ctlabel dropoff_ctlabel,
  46. drop_off.borocode dropoff_borocode,
  47. drop_off.boroname dropoff_boroname,
  48. drop_off.ct2010 dropoff_ct2010,
  49. drop_off.boroct2010 dropoff_boroct2010,
  50. drop_off.cdeligibil dropoff_cdeligibil,
  51. drop_off.ntacode dropoff_ntacode,
  52. drop_off.ntaname dropoff_ntaname,
  53. drop_off.puma dropoff_puma
  54. FROM trips
  55. LEFT JOIN cab_types
  56. ON trips.cab_type_id = cab_types.id
  57. LEFT JOIN central_park_weather_observations weather
  58. ON weather.date = trips.pickup_datetime::date
  59. LEFT JOIN nyct2010 pick_up
  60. ON pick_up.gid = trips.pickup_nyct2010_gid
  61. LEFT JOIN nyct2010 drop_off
  62. ON drop_off.gid = trips.dropoff_nyct2010_gid
  63. ) TO '/matrixone/export_data/trips.tsv';

3. 在MatrixOne中建表

  1. CREATE TABLE trips
  2. (
  3. trip_id int unsigned,
  4. vendor_id varchar(64),
  5. pickup_datetime bigint unsigned,
  6. dropoff_datetime bigint unsigned,
  7. store_and_fwd_flag char(1),
  8. rate_code_id smallint unsigned,
  9. pickup_longitude double,
  10. pickup_latitude double,
  11. dropoff_longitude double,
  12. dropoff_latitude double,
  13. passenger_count smallint unsigned,
  14. trip_distance double,
  15. distance bigint,
  16. fare_amount float,
  17. extra float,
  18. mta_tax float,
  19. tip_amount float,
  20. tolls_amount float,
  21. ehail_fee float,
  22. improvement_surcharge float,
  23. total_amount float,
  24. payment_type varchar(64),
  25. trip_type smallint unsigned,
  26. pickup varchar(64),
  27. dropoff varchar(64),
  28. cab_type varchar(64),
  29. precipitation float,
  30. snow_depth float,
  31. snowfall float,
  32. max_temperature smallint,
  33. min_temperature smallint,
  34. average_wind_speed float,
  35. pickup_nyct2010_gid smallint unsigned,
  36. pickup_ctlabel varchar(64),
  37. pickup_borocode smallint unsigned,
  38. pickup_boroname varchar(64),
  39. pickup_ct2010 varchar(64),
  40. pickup_boroct2010 varchar(64),
  41. pickup_cdeligibil char(1),
  42. pickup_ntacode varchar(64),
  43. pickup_ntaname varchar(64),
  44. pickup_puma varchar(64),
  45. dropoff_nyct2010_gid smallint unsigned,
  46. dropoff_ctlabel varchar(64),
  47. dropoff_borocode smallint unsigned,
  48. dropoff_boroname varchar(64),
  49. dropoff_ct2010 varchar(64),
  50. dropoff_boroct2010 varchar(64),
  51. dropoff_cdeligibil varchar(64),
  52. dropoff_ntacode varchar(64),
  53. dropoff_ntaname varchar(64),
  54. dropoff_puma varchar(64)
  55. ) ;

4. 向表中插入数据

  1. load data infile '/matrixone/export_data/trips.tsv ' into table trips
  2. FIELDS TERMINATED BY ','
  3. LINES TERMINATED BY '\n';

然后便可以使用MatrixOne中的查询语句来查询表中数据。

5. Run Queries

  1. # Q1
  2. SELECT cab_type, count(*) FROM trips GROUP BY cab_type;
  3. # Q2
  4. SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count;
  5. # Q3
  6. SELECT passenger_count, year(pickup_datetime) as year, count(*) FROM trips GROUP BY passenger_count, year;
  7. # Q4
  8. SELECT passenger_count, year(pickup_datetime) as year, round(trip_distance) AS distance, count(*) as count
  9. FROM trips
  10. GROUP BY passenger_count, year, distance
  11. ORDER BY year,count DESC;