第09章合并Pandas对象 - 4. concat, join, 和merge的区别 - 《Pandas Cookbook 带注释源码》

4. concat, join, 和merge的区别

concat：

Pandas函数
可以垂直和水平地连接两个或多个pandas对象
只用索引对齐
索引出现重复值时会报错
默认是外连接（也可以设为内连接）

join：

DataFrame方法
只能水平连接两个或多个pandas对象
对齐是靠被调用的DataFrame的列索引或行索引和另一个对象的行索引（不能是列索引）
通过笛卡尔积处理重复的索引值
默认是左连接（也可以设为内连接、外连接和右连接）

merge：

DataFrame方法
只能水平连接两个DataFrame对象
对齐是靠被调用的DataFrame的列或行索引和另一个DataFrame的列或行索引
通过笛卡尔积处理重复的索引值
默认是内连接（也可以设为左连接、外连接、右连接）

#  用户自定义的display_frames函数，可以接收一列DataFrame，然后在一行中显示：
 In[91]: from IPython.display import display_html
         years = 2016, 2017, 2018
         stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year), index_col='Symbol') 
                         for year in years]
         def display_frames(frames, num_spaces=0):
             t_style = '<table style="display: inline;"'
             tables_html = [df.to_html().replace('<table', t_style) for df in frames]
             space = '&nbsp;' * num_spaces
             display_html(space.join(tables_html), raw=True)
         display_frames(stock_tables, 30)
         stocks_2016, stocks_2017, stocks_2018 = stock_tables

#  concat是唯一一个可以将DataFrames垂直连接起来的函数
 In[92]: pd.concat(stock_tables, keys=[2016, 2017, 2018])
Out[92]:

#  concat也可以将DataFrame水平连起来
 In[93]: pd.concat(dict(zip(years,stock_tables)), axis='columns')
Out[93]:

#  用join将DataFrame连起来；如果列名有相同的，需要设置lsuffix或rsuffix以进行区分
 In[94]: stocks_2016.join(stocks_2017, lsuffix='_2016', rsuffix='_2017', how='outer')
Out[94]:

 In[95]: stocks_2016
Out[95]:

#  要重现前面的concat方法，可以将一个DataFrame列表传入join
 In[96]: other = [stocks_2017.add_suffix('_2017'), stocks_2018.add_suffix('_2018')]
         stocks_2016.add_suffix('_2016').join(other, how='outer')
Out[96]:

#  检验这两个方法是否相同
 In[97]: stock_join = stocks_2016.add_suffix('_2016').join(other, how='outer')
         stock_concat = pd.concat(dict(zip(years,stock_tables)), axis='columns')
 In[98]: stock_concat.columns = stock_concat.columns.get_level_values(1) + '_' + \
                                     stock_concat.columns.get_level_values(0).astype(str)
 In[99]: stock_concat
Out[99]:

 In[100]: step1 = stocks_2016.merge(stocks_2017, left_index=True, right_index=True, 
                                    how='outer', suffixes=('_2016', '_2017'))
          stock_merge = step1.merge(stocks_2018.add_suffix('_2018'), 
                                    left_index=True, right_index=True, how='outer')
          stock_concat.equals(stock_merge)
Out[100]: True

#  查看food_prices和food_transactions两个小数据集
 In[101]: names = ['prices', 'transactions']
          food_tables = [pd.read_csv('data/food_{}.csv'.format(name)) for name in names]
          food_prices, food_transactions = food_tables
          display_frames(food_tables, 30)

#  通过键item和store，将food_transactions和food_prices两个数据集融合
 In[102]: food_transactions.merge(food_prices, on=['item', 'store'])
Out[102]:

#  因为steak在两张表中分别出现了两次，融合时产生了笛卡尔积，造成结果中出现了四行steak；因为coconut没有对应的价格，造成结果中没有coconut
#  下面只融合2017年的数据
 In[103]: food_transactions.merge(food_prices.query('Date == 2017'), how='left')
Out[103]:

#  使用join复现上面的方法，需要需要将要连接的food_prices列转换为行索引
 In[104]: food_prices_join = food_prices.query('Date == 2017').set_index(['item', 'store'])
          food_prices_join
Out[104]:

#  join方法只对齐传入DataFrame的行索引，但可以对齐调用DataFrame的行索引和列索引；
#  要使用列做对齐，需要将其传给参数on
 In[105]: food_transactions.join(food_prices_join, on=['item', 'store'])
Out[105]:

#  要使用concat，需要将item和store两列放入两个DataFrame的行索引。但是，因为行索引值有重复，造成了错误
 In[106]: pd.concat([food_transactions.set_index(['item', 'store']), 
                     food_prices.set_index(['item', 'store'])], axis='columns')
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-106-8aa3223bf3d1> in <module>()
      1 pd.concat([food_transactions.set_index(['item', 'store']), 
----> 2            food_prices.set_index(['item', 'store'])], axis='columns')
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, copy)
    205                        verify_integrity=verify_integrity,
    206                        copy=copy)
--> 207     return op.get_result()
    208 
    209 
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/reshape/concat.py in get_result(self)
    399                     obj_labels = mgr.axes[ax]
    400                     if not new_labels.equals(obj_labels):
--> 401                         indexers[ax] = obj_labels.reindex(new_labels)[1]
    402 
    403                 mgrs_indexers.append((obj._data, indexers))
/Users/Ted/anaconda/lib/python3.6/site-packages/pandas/core/indexes/multi.py in reindex(self, target, method, level, limit, tolerance)
   1861                                                tolerance=tolerance)
   1862                 else:
-> 1863                     raise Exception("cannot handle a non-unique multi-index!")
   1864 
   1865         if not isinstance(target, MultiIndex):
Exception: cannot handle a non-unique multi-index!

#  glob模块的glob函数可以将文件夹中的文件迭代取出，取出的是文件名字符串列表，可以直接传给read_csv函数
 In[107]: import glob
          df_list = []
          for filename in glob.glob('data/gas prices/*.csv'):
              df_list.append(pd.read_csv(filename, index_col='Week', parse_dates=['Week']))
          gas = pd.concat(df_list, axis='columns')
          gas.head()
Out[107]: