7. 翻译SQL的WHERE语句
# 读取employee数据集
In[48]: employee = pd.read_csv('data/employee.csv')
# 对各项做下了解
In[49]: employee.DEPARTMENT.value_counts().head()
Out[49]: Houston Police Department-HPD 638
Houston Fire Department (HFD) 384
Public Works & Engineering-PWE 343
Health & Human Services 110
Houston Airport System (HAS) 106
Name: DEPARTMENT, dtype: int64
In[50]: employee.GENDER.value_counts()
Out[50]: Male 1397
Female 603
Name: GENDER, dtype: int64
In[51]: employee.BASE_SALARY.describe().astype(int)
Out[51]: count 1886
mean 55767
std 21693
min 24960
25% 40170
50% 54461
75% 66614
max 275000
Name: BASE_SALARY, dtype: int64
# 创建布尔条件,并从'UNIQUE_ID', 'DEPARTMENT', 'GENDER', 'BASE_SALARY'四列选取
In[52]: depts = ['Houston Police Department-HPD',
'Houston Fire Department (HFD)']
criteria_dept = employee.DEPARTMENT.isin(depts)
criteria_gender = employee.GENDER == 'Female'
criteria_sal = (employee.BASE_SALARY >= 80000) & \
(employee.BASE_SALARY <= 120000)
In[53]: criteria_final = criteria_dept & criteria_gender & criteria_sal
In[54]: select_columns = ['UNIQUE_ID', 'DEPARTMENT', 'GENDER', 'BASE_SALARY']
employee.loc[criteria_final, select_columns].head()
Out[54]:
更多
# 使用between选取80000到120000之间的薪水
In[55]: criteria_sal = employee.BASE_SALARY.between(80000, 120000)
# 排除最常出现的5家单位
In[56]: top_5_depts = employee.DEPARTMENT.value_counts().index[:5]
criteria = ~employee.DEPARTMENT.isin(top_5_depts)
employee[criteria].head()
Out[56]:
功能一样的SQL语句是:
SELECT
*
FROM
EMPLOYEE
WHERE
DEPARTMENT not in
(
SELECT
DEPARTMENT
FROM (
SELECT
DEPARTMENT,
COUNT(1) as CT
FROM
EMPLOYEE
GROUP BY
DEPARTMENT
ORDER BY
CT DESC
LIMIT 5
)
);