3. 索引爆炸
# 读取employee数据集,设定行索引是RACE
In[22]: employee = pd.read_csv('data/employee.csv', index_col='RACE')
employee.head()
Out[22]:
# 选取BASE_SALARY做成两个Series,判断二者是否相同
In[23]: salary1 = employee['BASE_SALARY']
salary2 = employee['BASE_SALARY']
salary1 is salary2
Out[23]: True
# 结果是True,表明二者指向的同一个对象。这意味着,如果修改一个,另一个也会去改变。为了收到一个全新的数据,使用copy方法:
In[24]: salary1 = employee['BASE_SALARY'].copy()
salary2 = employee['BASE_SALARY'].copy()
salary1 is salary2
Out[24]: False
# 对其中一个做索引排序,比较二者是否不同
In[25]: salary1 = salary1.sort_index()
salary1.head()
Out[25]: RACE
American Indian or Alaskan Native 78355.0
American Indian or Alaskan Native 26125.0
American Indian or Alaskan Native 98536.0
American Indian or Alaskan Native NaN
American Indian or Alaskan Native 55461.0
Name: BASE_SALARY, dtype: float64
In[26]: salary2.head()
Out[26]: RACE
Hispanic/Latino 121862.0
Hispanic/Latino 26125.0
White 45279.0
White 63166.0
White 56347.0
Name: BASE_SALARY, dtype: float64
# 将两个Series相加
In[27]: salary_add = salary1 + salary2
In[28]: salary_add.head()
Out[28]: RACE
American Indian or Alaskan Native 138702.0
American Indian or Alaskan Native 156710.0
American Indian or Alaskan Native 176891.0
American Indian or Alaskan Native 159594.0
American Indian or Alaskan Native 127734.0
Name: BASE_SALARY, dtype: float64
# 再将salary1与其自身相加;查看几个所得结果的长度,可以看到长度从2000到达了117万
In[29]: salary_add1 = salary1 + salary1
len(salary1), len(salary2), len(salary_add), len(salary_add1)
Out[29]: (2000, 2000, 1175424, 2000)
更多
# 验证salary_add值的个数。因为笛卡尔积是作用在相同索引元素上的,可以对其平方值求和
In[30]: index_vc = salary1.index.value_counts(dropna=False)
index_vc
Out[30]: Black or African American 700
White 665
Hispanic/Latino 480
Asian/Pacific Islander 107
NaN 35
American Indian or Alaskan Native 11
Others 2
Name: RACE, dtype: int64
In[31]: index_vc.pow(2).sum()
Out[31]: 1175424