9. 当多个变量被存储为列的值时进行清理

  1. # 读取restaurant_inspections数据集,将Date列的数据类型变为datetime64
  2. In[67]: inspections = pd.read_csv('data/restaurant_inspections.csv', parse_dates=['Date'])
  3. inspections.head(10)
  4. out[67]:

9. 当多个变量被存储为列的值时进行清理 - 图1

  1. # 用info列的所有值造一个新列。但是,Pandas不支持这种功能
  2. In[68]: inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')
  3. ---------------------------------------------------------------------------
  4. ValueError Traceback (most recent call last)
  5. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
  6. 297 try:
  7. --> 298 codes, categories = factorize(values, sort=True)
  8. 299 except TypeError:
  9. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
  10. 559 check_nulls = not is_integer_dtype(original)
  11. --> 560 labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
  12. 561
  13. pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_labels (pandas/_libs/hashtable.c:21922)()
  14. ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
  15. During handling of the above exception, another exception occurred:
  16. NotImplementedError Traceback (most recent call last)
  17. <ipython-input-68-754f69d68d6c> in <module>()
  18. ----> 1 inspections.pivot(index=['Name', 'Date'], columns='Info', values='Value')
  19. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in pivot(self, index, columns, values)
  20. 3851 """
  21. 3852 from pandas.core.reshape.reshape import pivot
  22. -> 3853 return pivot(self, index=index, columns=columns, values=values)
  23. 3854
  24. 3855 def stack(self, level=-1, dropna=True):
  25. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/reshape.py in pivot(self, index, columns, values)
  26. 375 index = self[index]
  27. 376 indexed = Series(self[values].values,
  28. --> 377 index=MultiIndex.from_arrays([index, self[columns]]))
  29. 378 return indexed.unstack(columns)
  30. 379
  31. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in from_arrays(cls, arrays, sortorder, names)
  32. 1098 from pandas.core.categorical import _factorize_from_iterables
  33. 1099
  34. -> 1100 labels, levels = _factorize_from_iterables(arrays)
  35. 1101 if names is None:
  36. 1102 names = [getattr(arr, "name", None) for arr in arrays]
  37. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterables(iterables)
  38. 2191 # For consistency, it should return a list of 2 lists.
  39. 2192 return [[], []]
  40. -> 2193 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  41. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in <listcomp>(.0)
  42. 2191 # For consistency, it should return a list of 2 lists.
  43. 2192 return [[], []]
  44. -> 2193 return map(list, lzip(*[_factorize_from_iterable(it) for it in iterables]))
  45. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in _factorize_from_iterable(values)
  46. 2163 codes = values.codes
  47. 2164 else:
  48. -> 2165 cat = Categorical(values, ordered=True)
  49. 2166 categories = cat.categories
  50. 2167 codes = cat.codes
  51. /Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/categorical.py in __init__(self, values, categories, ordered, fastpath)
  52. 308
  53. 309 # FIXME
  54. --> 310 raise NotImplementedError("> 1 ndim Categorical are not "
  55. 311 "supported at this time")
  56. 312
  57. NotImplementedError: > 1 ndim Categorical are not supported at this time
  1. # 将'Name','Date', 'Info'作为所索引
  2. In[69]: inspections.set_index(['Name','Date', 'Info']).head(10)
  3. out[69]:

9. 当多个变量被存储为列的值时进行清理 - 图2

  1. # 用pivot,将info列中的值变为新的列
  2. In[70]: inspections.set_index(['Name','Date', 'Info']).unstack('Info').head()
  3. out[70]:

9. 当多个变量被存储为列的值时进行清理 - 图3

  1. # 用reset_index方法,使行索引层级与列索引相同
  2. In[71]: insp_tidy = inspections.set_index(['Name','Date', 'Info']) \
  3. .unstack('Info') \
  4. .reset_index(col_level=-1)
  5. insp_tidy.head()
  6. out[71]:

9. 当多个变量被存储为列的值时进行清理 - 图4

  1. # 除掉列索引的最外层,重命名行索引的层为None
  2. In[72]: insp_tidy.columns = insp_tidy.columns.droplevel(0).rename(None)
  3. insp_tidy.head()
  4. out[72]:

9. 当多个变量被存储为列的值时进行清理 - 图5

  1. # 使用squeeze方法,可以避免前面的多级索引
  2. In[73]: inspections.set_index(['Name','Date', 'Info']) \
  3. .squeeze() \
  4. .unstack('Info') \
  5. .reset_index() \
  6. .rename_axis(None, axis='columns')
  7. out[73]:

9. 当多个变量被存储为列的值时进行清理 - 图6

更多

  1. # pivot_table需要传入聚合函数,才能产生一个单一值
  2. In[74]: inspections.pivot_table(index=['Name', 'Date'],
  3. columns='Info',
  4. values='Value',
  5. aggfunc='first') \
  6. .reset_index()\
  7. .rename_axis(None, axis='columns')
  8. out[74]:

9. 当多个变量被存储为列的值时进行清理 - 图7