第二章
原文:Chapter 2
译者:飞龙
# 通常的开头
import pandas as pd
# 使图表更大更漂亮
pd.set_option('display.mpl_style', 'default')
pd.set_option('display.line_width', 5000)
pd.set_option('display.max_columns', 60)
figsize(15, 5)
我们将在这里使用一个新的数据集,来演示如何处理更大的数据集。 这是来自 NYC Open Data 的 311 个服务请求的子集。
complaints = pd.read_csv('../data/311-service-requests.csv')
2.1 里面究竟有什么?(总结)
当你查看一个大型数据框架,而不是显示数据框架的内容,它会显示一个摘要。 这包括所有列,以及每列中有多少非空值。
complaints
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 52 columns):
Unique Key 111069 non-null values
Created Date 111069 non-null values
Closed Date 60270 non-null values
Agency 111069 non-null values
Agency Name 111069 non-null values
Complaint Type 111069 non-null values
Descriptor 111068 non-null values
Location Type 79048 non-null values
Incident Zip 98813 non-null values
Incident Address 84441 non-null values
Street Name 84438 non-null values
Cross Street 1 84728 non-null values
Cross Street 2 84005 non-null values
Intersection Street 1 19364 non-null values
Intersection Street 2 19366 non-null values
Address Type 102247 non-null values
City 98860 non-null values
Landmark 95 non-null values
Facility Type 110938 non-null values
Status 111069 non-null values
Due Date 39239 non-null values
Resolution Action Updated Date 96507 non-null values
Community Board 111069 non-null values
Borough 111069 non-null values
X Coordinate (State Plane) 98143 non-null values
Y Coordinate (State Plane) 98143 non-null values
Park Facility Name 111069 non-null values
Park Borough 111069 non-null values
School Name 111069 non-null values
School Number 111052 non-null values
School Region 110524 non-null values
School Code 110524 non-null values
School Phone Number 111069 non-null values
School Address 111069 non-null values
School City 111069 non-null values
School State 111069 non-null values
School Zip 111069 non-null values
School Not Found 38984 non-null values
School or Citywide Complaint 0 non-null values
Vehicle Type 99 non-null values
Taxi Company Borough 117 non-null values
Taxi Pick Up Location 1059 non-null values
Bridge Highway Name 185 non-null values
Bridge Highway Direction 185 non-null values
Road Ramp 184 non-null values
Bridge Highway Segment 223 non-null values
Garage Lot Name 49 non-null values
Ferry Direction 37 non-null values
Ferry Terminal Name 336 non-null values
Latitude 98143 non-null values
Longitude 98143 non-null values
Location 98143 non-null values
dtypes: float64(5), int64(1), object(46)
2.2 选择列和行
为了选择一列,使用列名称作为索引,像这样:
complaints['Complaint Type']
0 Noise - Street/Sidewalk
1 Illegal Parking
2 Noise - Commercial
3 Noise - Vehicle
4 Rodent
5 Noise - Commercial
6 Blocked Driveway
7 Noise - Commercial
8 Noise - Commercial
9 Noise - Commercial
10 Noise - House of Worship
11 Noise - Commercial
12 Illegal Parking
13 Noise - Vehicle
14 Rodent
...
111054 Noise - Street/Sidewalk
111055 Noise - Commercial
111056 Street Sign - Missing
111057 Noise
111058 Noise - Commercial
111059 Noise - Street/Sidewalk
111060 Noise
111061 Noise - Commercial
111062 Water System
111063 Water System
111064 Maintenance or Facility
111065 Illegal Parking
111066 Noise - Street/Sidewalk
111067 Noise - Commercial
111068 Blocked Driveway
Name: Complaint Type, Length: 111069, dtype: object
要获得DataFrame
的前 5 行,我们可以使用切片:df [:5]
。
这是一个了解数据框架中存在什么信息的很好方式 - 花一点时间来查看内容并获得此数据集的感觉。
complaints[:5]
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Street Name | Cross Street 1 | Cross Street 2 | Intersection Street 1 | Intersection Street 2 | Address Type | City | Landmark | Facility Type | Status | Due Date | Resolution Action Updated Date | Community Board | Borough | X Coordinate (State Plane) | Y Coordinate (State Plane) | Park Facility Name | Park Borough | School Name | School Number | School Region | School Code | School Phone Number | School Address | School City | School State | School Zip | School Not Found | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26589651 | 10/31/2013 02:08:41 AM | NaN | NYPD | New York City Police Department | Noise - Street/Sidewalk | Loud Talking | Street/Sidewalk | 11432 | 90-03 169 STREET | 169 STREET | 90 AVENUE | 91 AVENUE | NaN | NaN | ADDRESS | JAMAICA | NaN | Precinct | Assigned | 10/31/2013 10:08:41 AM | 10/31/2013 02:35:17 AM | 12 QUEENS | QUEENS | 1042027 | 197389 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.708275 | -73.791604 | (40.70827532593202, -73.79160395779721) |
1 | 26593698 | 10/31/2013 02:01:04 AM | NaN | NYPD | New York City Police Department | Illegal Parking | Commercial Overnight Parking | Street/Sidewalk | 11378 | 58 AVENUE | 58 AVENUE | 58 PLACE | 59 STREET | NaN | NaN | BLOCKFACE | MASPETH | NaN | Precinct | Open | 10/31/2013 10:01:04 AM | NaN | 05 QUEENS | QUEENS | 1009349 | 201984 | Unspecified | QUEENS | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.721041 | -73.909453 | (40.721040535628305, -73.90945306791765) |
2 | 26594139 | 10/31/2013 02:00:24 AM | 10/31/2013 02:40:32 AM | NYPD | New York City Police Department | Noise - Commercial | Loud Music/Party | Club/Bar/Restaurant | 10032 | 4060 BROADWAY | BROADWAY | WEST 171 STREET | WEST 172 STREET | NaN | NaN | ADDRESS | NEW YORK | NaN | Precinct | Closed | 10/31/2013 10:00:24 AM | 10/31/2013 02:39:42 AM | 12 MANHATTAN | MANHATTAN | 1001088 | 246531 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.843330 | -73.939144 | (40.84332975466513, -73.93914371913482) |
3 | 26595721 | 10/31/2013 01:56:23 AM | 10/31/2013 02:21:48 AM | NYPD | New York City Police Department | Noise - Vehicle | Car/Truck Horn | Street/Sidewalk | 10023 | WEST 72 STREET | WEST 72 STREET | COLUMBUS AVENUE | AMSTERDAM AVENUE | NaN | NaN | BLOCKFACE | NEW YORK | NaN | Precinct | Closed | 10/31/2013 09:56:23 AM | 10/31/2013 02:21:10 AM | 07 MANHATTAN | MANHATTAN | 989730 | 222727 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.778009 | -73.980213 | (40.7780087446372, -73.98021349023975) |
4 | 26590930 | 10/31/2013 01:53:44 AM | NaN | DOHMH | Department of Health and Mental Hygiene | Rodent | Condition Attracting Rodents | Vacant Lot | 10027 | WEST 124 STREET | WEST 124 STREET | LENOX AVENUE | ADAM CLAYTON POWELL JR BOULEVARD | NaN | NaN | BLOCKFACE | NEW YORK | NaN | N/A | Pending | 11/30/2013 01:53:44 AM | 10/31/2013 01:59:54 AM | 10 MANHATTAN | MANHATTAN | 998815 | 233545 | Unspecified | MANHATTAN | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | Unspecified | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 40.807691 | -73.947387 | (40.80769092704951, |
我们可以组合它们来获得一列的前五行。
complaints['Complaint Type'][:5]
0 Noise - Street/Sidewalk
1 Illegal Parking
2 Noise - Commercial
3 Noise - Vehicle
4 Rodent
Name: Complaint Type, dtype: object
并且无论我们以什么方向:
complaints[:5]['Complaint Type']
0 Noise - Street/Sidewalk
1 Illegal Parking
2 Noise - Commercial
3 Noise - Vehicle
4 Rodent
Name: Complaint Type, dtype: object
2.3 选择多列
如果我们只关心投诉类型和区,但不关心其余的信息怎么办? Pandas 使它很容易选择列的一个子集:只需将所需列的列表用作索引。
complaints[['Complaint Type', 'Borough']]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 2 columns):
Complaint Type 111069 non-null values
Borough 111069 non-null values
dtypes: object(2)
这会向我们展示总结,我们可以获取前 10 列:
complaints[['Complaint Type', 'Borough']][:10]
Complaint Type | Borough | |
---|---|---|
0 | Noise - Street/Sidewalk | QUEENS |
1 | Illegal Parking | QUEENS |
2 | Noise - Commercial | MANHATTAN |
3 | Noise - Vehicle | MANHATTAN |
4 | Rodent | MANHATTAN |
5 | Noise - Commercial | QUEENS |
6 | Blocked Driveway | QUEENS |
7 | Noise - Commercial | QUEENS |
8 | Noise - Commercial | MANHATTAN |
9 | Noise - Commercial | BROOKLYN |
2.4 什么是最常见的投诉类型?
这是个易于回答的问题,我们可以调用.value_counts()
方法:
complaints['Complaint Type'].value_counts()
HEATING 14200
GENERAL CONSTRUCTION 7471
Street Light Condition 7117
DOF Literature Request 5797
PLUMBING 5373
PAINT - PLASTER 5149
Blocked Driveway 4590
NONCONST 3998
Street Condition 3473
Illegal Parking 3343
Noise 3321
Traffic Signal Condition 3145
Dirty Conditions 2653
Water System 2636
Noise - Commercial 2578
...
Opinion for the Mayor 2
Window Guard 2
DFTA Literature Request 2
Legal Services Provider Complaint 2
Open Flame Permit 1
Snow 1
Municipal Parking Facility 1
X-Ray Machine/Equipment 1
Stalled Sites 1
DHS Income Savings Requirement 1
Tunnel Condition 1
Highway Sign - Damaged 1
Ferry Permit 1
Trans Fat 1
DWD 1
Length: 165, dtype: int64
如果我们想要最常见的 10 个投诉类型,我们可以这样:
complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
HEATING 14200
GENERAL CONSTRUCTION 7471
Street Light Condition 7117
DOF Literature Request 5797
PLUMBING 5373
PAINT - PLASTER 5149
Blocked Driveway 4590
NONCONST 3998
Street Condition 3473
Illegal Parking 3343
dtype: int64
但是还可以更好,我们可以绘制出来!
complaint_counts[:10].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x7ba2290>