使用DataFrame實(shí)現(xiàn)兩表連接方式

更新時間：2023年08月15日 10:39:45 作者：我只是個過路人

這篇文章主要介紹了使用DataFrame實(shí)現(xiàn)兩表連接方式,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教

DataFrame實(shí)現(xiàn)兩表連接

連接查詢：包含連接操作的查詢稱為連接查詢

連接查詢包含：等值，自然，外連接，內(nèi)連接，坐連接，自連接……

挖坑坑，深入學(xué)習(xí)了慢慢填。

pandas的DataFrame的連接不算真正意義的連接查詢，只是在兩個DataFrame中的操作達(dá)到了像連接查詢的效果

用pandas庫下的DataFram創(chuàng)建DataFrame類型的數(shù)據(jù)

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                      'B': ['B0', 'B1', 'B2']})
caller = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                      'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})

假設(shè)other是一張表，caller是另一張表，等值連接的SQL查詢語句可以表示為

select other.*,caller*
from other,caller
where other.Key=caller.key

用pandas.DataFrame.join實(shí)現(xiàn)相同的效果

caller.join(other.set_index('key'),on='key',how='left').dropna()

用pd.merge實(shí)現(xiàn)

pd.merge(left=caller, right=other, how='left', left_on='key', right_on='key'
,left_index=True,right_index=True)

pd.merge(left=caller, right=other, how='inner', left_on='key', right_on='key',left_index=True,right_index=True)

pandas中DataFrame表連接操作，及merge與join區(qū)別

為了方便維護(hù)，一般公司的數(shù)據(jù)在數(shù)據(jù)庫內(nèi)都是分表存儲的，比如用一個表存儲所有用戶的基本信息，一個表存儲用戶的消費(fèi)情況。

所以，在日常的數(shù)據(jù)處理中，經(jīng)常需要將兩張表拼接起來使用，這樣的操作對應(yīng)到SQL中是join，在Pandas中則是用merge來實(shí)現(xiàn)。

上面的引入部分說到merge是用來拼接兩張表的，那么拼接時自然就需要將用戶信息一一對應(yīng)地進(jìn)行拼接。

所以進(jìn)行拼接的兩張表需要有一個共同的識別用戶的鍵（key），也就是on參數(shù)所指定的列。

總結(jié)來說，整個merge的過程就是將信息一一對應(yīng)匹配的過程，下面介紹merge的四種類型，分別為'inner'、'left'、'right'和'outer'。

merge參數(shù)講解

merge(
    left,            # 左表
    right,           # 右表
    how="inner",     # 連接方式，inner、left、right、outer，默認(rèn)為inner
    on=None,
    """
    on: 用于連接的列名稱
    指定合并時用于連接(外連，內(nèi)連，左連，右連)的列。
    默認(rèn)為None，merge()方法自動識別兩個DataFrame中名字相同的列，作為連接的列。
    on參數(shù)指定的列必須在兩個被合并DataFrame中都有，否則會報錯。
    on參數(shù)也可以指定多列，合并時按多個列進(jìn)行連接。在合并時，只有多個列的值同時相等，兩個DataFrame才會匹配上。
    """
    left_on,         # 左表用于連接的列名
    right_on,        # 右表用于連接的列名
    """
    使用on參數(shù)時，指定的列必須在兩個DataFrame中都有。
    merge()方法也支持兩個DataFrame分別指定連接的列，此時不要求指定列在兩個DataFrame中都有。
    當(dāng)left_on和right_on都指定一樣的列時，與用on參數(shù)的結(jié)果一樣。
    """
    left_index,      # 是否使用左表的行索引作為連接鍵，默認(rèn)False
    right_index,     # 是否使用右表的行索引作為連接鍵，默認(rèn)False
    sort,            # 默認(rèn)為False，將合并的數(shù)據(jù)進(jìn)行排序
    copy,            # 默認(rèn)為True，總是將數(shù)據(jù)復(fù)制到數(shù)據(jù)結(jié)構(gòu)中，設(shè)置為False可以提高性能
    suffixes,        # 存在相同列名時在列名后面添加的后綴，默認(rèn)為('_x', ‘_y')
    indicator,       # 顯示合并數(shù)據(jù)中數(shù)據(jù)來自哪個表
    validate=None,
    """
    validate: 用于指定兩個DataFrame連接列的對應(yīng)關(guān)系。
    有one_to_one(一對一)，one_to_many(一對多)，many_to_one(多對一)，many_to_many(多對多)四種對應(yīng)方式。
    默認(rèn)為None，merge()方法自動根據(jù)兩個DataFrame的連接列采用適合的對應(yīng)方式。
    """
)

創(chuàng)建兩個DataFrame

dishes_info = pd.read_csv("./dishes_info.csv")
order_sample = pd.read_csv("./order_sample.csv")
print(dishes_info)
print(order_sample)

dishes_info：

order_sample：

我們使用merge方法將兩表根據(jù)dishes_id列連接起來，使用左連接的方式

data = pd.merge(dishes_info, order_sample, on="dishes_id", how="left")

新表data數(shù)據(jù)如下

以上是常用的方式，根據(jù)兩表都具有的列(相同列名，相同類型)進(jìn)行表連接。

有的時候，合并操作不是用DataFrame的列，而是用索引作為鍵。把left_index和right_index選項(xiàng)的值置為True，就可將其作為合并DataFrame的基準(zhǔn)。

data = pd.merge(dishes_info, order_sample, how="left", left_index=True, right_index=True)

生成的data數(shù)據(jù)如下：

此文章對這兩表進(jìn)行以索引為基準(zhǔn)的連接操作，沒有意義，這兩表就不是這樣連的(有意義的連接應(yīng)該是根據(jù)dishes_id進(jìn)行連接)，主要就是硬解釋一下left_index和right_index選項(xiàng)的作用。

pandas中join函數(shù)的使用方式

join(
    other,            # DataFrame, Series, or list of DataFrame，另外一個dataframe, series，或者dataframe list。
    on=None,          # 參與join的列，與sql中的on參數(shù)類似。
    how=“l(fā)eft”,       # how: {‘left', ‘right', ‘outer', ‘inner'}, default ‘left'， 與sql中的join方式類似。
    lsuffix="",       #  lsuffix: 左DataFrame中重復(fù)列的后綴
    rsuffix="",       # rsuffix: 右DataFrame中重復(fù)列的后綴
    sort=False        # 默認(rèn)為False，將合并的數(shù)據(jù)進(jìn)行排序
)

DataFrame對象的join()函數(shù)就像是merge()函數(shù)的left_index & right_index 為 True。
DataFrame對象的join()函數(shù)更適合于根據(jù)索引進(jìn)行合并，我們可以用它合并多個索引相同列不同的DataFrame對象。

以下是我根據(jù)行索引進(jìn)行連接，報錯顯示有相同的列，就是兩表都有dishes_id這列，但是我沒有修改

data = dishes_info.join(order_sample)         # 會報錯，原因就是因?yàn)橛兄貜?fù)的列名
# 以下為錯誤信息
"""
dishes_id
ValueError                                Traceback (most recent call last)
<ipython-input-18-8bc025c8fee6> in <module>()
      1 # DataFrame對象的join()函數(shù)就像是merge()函數(shù)的left_index & right_index 為 True
      2 # DataFrame對象的join()函數(shù)更適合于根據(jù)索引進(jìn)行合并，我們可以用它合并多個索引相同列不同的DataFrame對象。
----> 3 data = dishes_info.join(order_sample)         # 會報錯，原因就是因?yàn)橛兄貜?fù)的列名dishes_id
      4 # 由于join默認(rèn)根據(jù)行索引進(jìn)行連接，所以我們修改兩表的行索引為dishes_id列在進(jìn)行連接
      5 # dishes_info.set_index("dishes_id", inplace=True)     # 該函數(shù)默認(rèn)不修改原數(shù)據(jù)，需要inplace配置項(xiàng)指定為True才保存修改
D:\Destination\lib\site-packages\pandas\core\frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6337 
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',
D:\Destination\lib\site-packages\pandas\core\frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:
D:\Destination\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     60                          copy=copy, indicator=indicator,
     61                          validate=validate)
---> 62     return op.get_result()
     63 
     64 
D:\Destination\lib\site-packages\pandas\core\reshape\merge.py in get_result(self)
    572 
    573         llabels, rlabels = items_overlap_with_suffix(ldata.items, lsuf,
--> 574                                                      rdata.items, rsuf)
    575 
    576         lindexers = {1: left_indexer} if left_indexer is not None else {}
D:\Destination\lib\site-packages\pandas\core\internals.py in items_overlap_with_suffix(left, lsuffix, right, rsuffix)
   5242         if not lsuffix and not rsuffix:
   5243             raise ValueError('columns overlap but no suffix specified: '
-> 5244                              '{rename}'.format(rename=to_rename))
   5245 
   5246         def lrenamer(x):
ValueError: columns overlap but no suffix specified: Index(['dishes_id'], dtype='object')
"""

我們對join正確的用法：

由于join默認(rèn)根據(jù)行索引進(jìn)行連接，所以我們修改兩表的行索引為dishes_id列在進(jìn)行連接

dishes_info.set_index("dishes_id", inplace=True)     # 該函數(shù)默認(rèn)不修改原數(shù)據(jù)，需要inplace配置項(xiàng)指定為True才保存修改
order_sample.set_index("dishes_id", inplace=True)

dishes_info數(shù)據(jù)：