Complete a NYC Test with MatrixOne
New York City (NYC) Taxi data set captures detailed information on billions of individual taxi trips in New York City, including pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts(Most of the raw data comes from the NYC Taxi & Limousine Commission).
By going through this tutorial, you’ll learn how to complete some queries on NYC Taxi data with MatrixOne.
For detail description and instructions for downloading about NYC Taxi Data, you can see:
Before you begin
Make sure you have already installed MatrixOne and connected to MatrixOne Server.
1. Download and Import data
This section of the tutorial references here, and you can get original information about how to download and import raw data there.
The data set has 1.7 billion rows data and takes up 450 GB of space, so make sure there are enough space to hold the data.
- Install PostgreSQL and PostGIS
Both are available via Homebrew on Mac.
- Download raw data
./download_raw_data.sh && ./remove_bad_rows.sh
The remove_bad_rows.sh script fixes two particular files that have a few rows with too many columns.
For more detailed information about this, you can see the original references.
- Initialize database and set up schema
./initialize_database.sh
- Import taxi and FHV data
./import_trip_data.sh
./import_fhv_trip_data.sh
- Optional: download and import 2014 Uber data
The FiveThirtyEight Uber dataset contains Uber trip records from Apr–Sep 2014. Uber and other FHV (Lyft, Juno, Via, etc.) data is available since Jan 2015 in the TLC’s data.
./download_raw_2014_uber_data.sh
./import_2014_uber_trip_data.sh
2. Exporting the data from PostgreSQL
COPY
(
SELECT trips.id,
trips.vendor_id,
trips.pickup_datetime,
trips.dropoff_datetime,
trips.store_and_fwd_flag,
trips.rate_code_id,
trips.pickup_longitude,
trips.pickup_latitude,
trips.dropoff_longitude,
trips.dropoff_latitude,
trips.passenger_count,
trips.trip_distance,
trips.fare_amount,
trips.extra,
trips.mta_tax,
trips.tip_amount,
trips.tolls_amount,
trips.ehail_fee,
trips.improvement_surcharge,
trips.total_amount,
trips.payment_type,
trips.trip_type,
trips.pickup_location_id,
trips.dropoff_location_id,
cab_types.type cab_type,
weather.precipitation rain,
weather.snow_depth,
weather.snowfall,
weather.max_temperature max_temp,
weather.min_temperature min_temp,
weather.average_wind_speed wind,
pick_up.gid pickup_nyct2010_gid,
pick_up.ctlabel pickup_ctlabel,
pick_up.borocode pickup_borocode,
pick_up.boroname pickup_boroname,
pick_up.ct2010 pickup_ct2010,
pick_up.boroct2010 pickup_boroct2010,
pick_up.cdeligibil pickup_cdeligibil,
pick_up.ntacode pickup_ntacode,
pick_up.ntaname pickup_ntaname,
pick_up.puma pickup_puma,
drop_off.gid dropoff_nyct2010_gid,
drop_off.ctlabel dropoff_ctlabel,
drop_off.borocode dropoff_borocode,
drop_off.boroname dropoff_boroname,
drop_off.ct2010 dropoff_ct2010,
drop_off.boroct2010 dropoff_boroct2010,
drop_off.cdeligibil dropoff_cdeligibil,
drop_off.ntacode dropoff_ntacode,
drop_off.ntaname dropoff_ntaname,
drop_off.puma dropoff_puma
FROM trips
LEFT JOIN cab_types
ON trips.cab_type_id = cab_types.id
LEFT JOIN central_park_weather_observations weather
ON weather.date = trips.pickup_datetime::date
LEFT JOIN nyct2010 pick_up
ON pick_up.gid = trips.pickup_nyct2010_gid
LEFT JOIN nyct2010 drop_off
ON drop_off.gid = trips.dropoff_nyct2010_gid
) TO '/matrixone/export_data/trips.tsv';
3. Create tables in MatrixOne
CREATE TABLE trips
(
trip_id int unsigned,
vendor_id varchar(64),
pickup_datetime bigint unsigned,
dropoff_datetime bigint unsigned,
store_and_fwd_flag char(1),
rate_code_id smallint unsigned,
pickup_longitude double,
pickup_latitude double,
dropoff_longitude double,
dropoff_latitude double,
passenger_count smallint unsigned,
trip_distance double,
distance bigint,
fare_amount float,
extra float,
mta_tax float,
tip_amount float,
tolls_amount float,
ehail_fee float,
improvement_surcharge float,
total_amount float,
payment_type varchar(64),
trip_type smallint unsigned,
pickup varchar(64),
dropoff varchar(64),
cab_type varchar(64),
precipitation float,
snow_depth float,
snowfall float,
max_temperature smallint,
min_temperature smallint,
average_wind_speed float,
pickup_nyct2010_gid smallint unsigned,
pickup_ctlabel varchar(64),
pickup_borocode smallint unsigned,
pickup_boroname varchar(64),
pickup_ct2010 varchar(64),
pickup_boroct2010 varchar(64),
pickup_cdeligibil char(1),
pickup_ntacode varchar(64),
pickup_ntaname varchar(64),
pickup_puma varchar(64),
dropoff_nyct2010_gid smallint unsigned,
dropoff_ctlabel varchar(64),
dropoff_borocode smallint unsigned,
dropoff_boroname varchar(64),
dropoff_ct2010 varchar(64),
dropoff_boroct2010 varchar(64),
dropoff_cdeligibil varchar(64),
dropoff_ntacode varchar(64),
dropoff_ntaname varchar(64),
dropoff_puma varchar(64)
) ;
4. Insert data into the created tables
load data infile '/matrixone/export_data/trips.tsv ' into table trips FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Then you can query data in MatrixOne with the created table.
5. Run Queries
# Q1
SELECT cab_type, count(*) FROM trips GROUP BY cab_type;
# Q2
SELECT passenger_count, avg(total_amount) FROM trips GROUP BY passenger_count;
# Q3
SELECT passenger_count, year(pickup_datetime) as year, count(*) FROM trips GROUP BY passenger_count, year;
# Q4
SELECT passenger_count, year(pickup_datetime) as year, round(trip_distance) AS distance, count(*) as count
FROM trips
GROUP BY passenger_count, year, distance
ORDER BY year,count DESC;