python pandas模塊基礎(chǔ)學(xué)習(xí)詳解

更新時間：2019年07月03日 09:31:48 作者：To_2020_1_4

這篇文章主要介紹了python pandas模塊基礎(chǔ)學(xué)習(xí)詳解的相關(guān)資料,文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友可以參考下

Pandas類似R語言中的數(shù)據(jù)框(DataFrame),Pandas基于Numpy,但是對于數(shù)據(jù)框結(jié)構(gòu)的處理比Numpy要來的容易。

1. Pandas的基本數(shù)據(jù)結(jié)構(gòu)和使用

Pandas有兩個主要的數(shù)據(jù)結(jié)構(gòu)：Series和DataFrame。Series類似Numpy中的一維數(shù)組，DataFrame則是使用較多的多維表格數(shù)據(jù)結(jié)構(gòu)。

Series的創(chuàng)建

>>>import numpy as np
>>>import pandas as pd
>>>s=pd.Series([1,2,3,np.nan,44,1]) # np.nan創(chuàng)建一個缺失數(shù)值
>>>s　# 若未指定，Series會自動建立index，此處自動建立索引0-5
0   1.0
1   2.0
2   3.0
3   NaN
4  44.0
5   1.0
dtype: float64

DataFrame的創(chuàng)建

>>>dates=pd.date_range('20170101',periods=6)
>>>dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
        '2017-01-05', '2017-01-06'],
       dtype='datetime64[ns]', freq='D')
>>>df=pd.DataFrame(np.random.randn(6,4),index=dates,columns=['a','b','c','d'])
>>>df
           a     b     c     d
2017-01-01 -1.993447 1.272175 -1.578337 -1.972526
2017-01-02 0.092701 -0.503654 -0.540655 -0.126386
2017-01-03 0.191769 -0.578872 -1.693449 0.457891
2017-01-04 2.121120 0.521884 -0.419368 -1.916585
2017-01-05 1.642063 0.222134 0.108531 -1.858906
2017-01-06 0.636639 0.487491 0.617841 -1.597920

DataFrame可以跟Numpy一樣根據(jù)索引取出其中的數(shù)據(jù)，只是DataFrame索引方式更加多樣化。DataFrame不僅可以根據(jù)默認(rèn)的行列編號來索引，還可以根據(jù)標(biāo)簽序列來索引。

還可以采用字典的方式創(chuàng)建DataFrame：

>>>df2=pd.DataFrame({'a':1,'b':'hello kitty','c':np.arange(2),'d':['o','k']})
>>>df2
   a      b c d
0 1 hello kitty 0 o
1 1 hello kitty 1 k

對于DataFrame的一些屬性也可以采用相應(yīng)的方法查看

dtype # 查看數(shù)據(jù)類型
index # 查看行序列或者索引
columns # 查看各列的標(biāo)簽
values　# 查看數(shù)據(jù)框內(nèi)的數(shù)據(jù)，也即不含表頭索引的數(shù)據(jù)
describe # 查看數(shù)據(jù)的一些信息，如每一列的極值，均值，中位數(shù)之類的，只能對數(shù)值型數(shù)據(jù)統(tǒng)計信息
transpose # 轉(zhuǎn)置，也可用Ｔ來操作
sort_index # 排序，可按行或列index排序輸出
sort_values # 按數(shù)據(jù)值來排序

一些例子

>>>df2.dtypes
a   int64
b  object
c   int64
d  object
dtype: object
>>>df2.index
RangeIndex(start=0, stop=2, step=1)
>>>df2.columns
Index(['a', 'b', 'c', 'd'], dtype='object')
>>>df2.values
array([[1, 'hello kitty', 0, 'o'],
    [1, 'hello kitty', 1, 'k']], dtype=object)
>>>df2.describe # 只能對數(shù)值型數(shù)據(jù)統(tǒng)計信息
     a     c
count 2.0 2.000000
mean  1.0 0.500000
std  0.0 0.707107
min  1.0 0.000000
25%  1.0 0.250000
50%  1.0 0.500000
75%  1.0 0.750000
max  1.0 1.000000
>>>df2.T
       0      1
a      1      1
b hello kitty hello kitty
c      0      1
d      o      k
>>>df2.sort_index(axis=1,ascending=False) # axis=1 按列標(biāo)簽從大到小排列
   d c      b a
0 o 0 hello kitty 1
1 k 1 hello kitty 1
>>>df2.sort_index(axis=0,ascending=False) # 按行標(biāo)簽從大到小排序
   a      b c d
1 1 hello kitty 1 k
0 1 hello kitty 0 o
>>>df2.sort_values(by="c",ascending=False) # 按c列的值從大到小排序
  　a      b c d
1 1 hello kitty 1 k
0 1 hello kitty 0 o

2. 從DataFrame中篩選取出目的數(shù)據(jù)

從DataFrame中取出目的數(shù)據(jù)方法有多種，一般常用的有：

　- 直接根據(jù)索引選取
　- 根據(jù)標(biāo)簽選取(縱向選擇列)：loc
　- 根據(jù)序列(橫向選擇行): iloc
　- 組合使用標(biāo)簽序列來選取特定位置的數(shù)據(jù): ix
　- 通過邏輯判斷篩選

簡單選取

>>>import numpy as np
>>>import pandas as pd
>>>dates=pd.date_range('20170101',periods=6)
>>>df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates,columns=['a','b','c','d'])
>>>df
        a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
>>>df['a']     # 根據(jù)表簽直接選?。崃?，也可用df.a,結(jié)果相同
2017-01-01   0
2017-01-02   4
2017-01-03   8
2017-01-04  12
2017-01-05  16
2017-01-06  20
Freq: D, Name: a, dtype: int64
>>>df[0:3]  # 選擇前３行，也可用行標(biāo)簽 df['2017-01-01':'2017-01-03'],結(jié)果相同,但是無法用此法選擇多列
       a b  c  d
2017-01-01 0 1  2  3
2017-01-02 4 5  6  7
2017-01-03 8 9 10 11

loc使用顯式的行標(biāo)簽來選取數(shù)據(jù)

DataFrame行的表示方式有兩種，一種是通過顯式的行標(biāo)簽來索引，另一種是通過默認(rèn)隱式的行號來索引。loc方法是通過行標(biāo)簽來索引選取目標(biāo)行，可以配合列標(biāo)簽來選取特定位置的數(shù)據(jù)。

>>>df.loc['2017-01-01':'2017-01-03']
       a b  c  d
2017-01-01 0 1  2  3
2017-01-02 4 5  6  7
2017-01-03 8 9 10 11
>>>df.loc['2017-01-01',['a','b']]  # 選取特定行的a,b列
a  0
b  1
Name: 2017-01-01 00:00:00, dtype: int64

iloc使用隱式的行序列號來選取數(shù)據(jù)

使用iloc可以搭配列序列號來更簡單的選取特定位點(diǎn)的數(shù)據(jù)

>>>df.iloc[3,1]
13
>>>df.iloc[1:3,2:4]
        c  d
2017-01-02  6  7
2017-01-03 10 11

ix利用ix可以混用顯式標(biāo)簽與隱式序列號

loc只能使用顯式標(biāo)簽來選取數(shù)據(jù)，而iloc只能使用隱式序列號來選取數(shù)據(jù)，ix則能將二者結(jié)合起來使用。

>>> df.ix[3:5,['a','b']]
       a  b
2017-01-04 12 13
2017-01-05 16 17

使用邏輯判斷來選取數(shù)據(jù)

>>>df
        a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
>>>df[df['a']>5] # 等價于df[df.a>5]
        a  b  c  d
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23

3. Pandas設(shè)置特定位置值

>>>import numpy as np
>>>import pandas as pd
>>>dates=pd.date_range('20170101',periods=6)
>>>datas=np.arange(24).reshape((6,4))
>>>columns=['a','b','c','d']
>>>df=pd.DataFra me(data=datas,index=dates,colums=columns)
>>>df.iloc[2,2:4]=111 # 將第２行2,3列位置的數(shù)據(jù)改為111
        a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 111 111
2017-01-04 12 13  14  15
2017-01-05 16 17  18  19
2017-01-06 20 21  22  23
>>>df.b[df['a']>10]=0 # 等價于df.b[df.a>10] # 以ａ列大于10的數(shù)的位置為參考，改變b列相應(yīng)行的數(shù)值為0
        a b  c  d
2017-01-01  0 1  2  3
2017-01-02  4 5  6  7
2017-01-03  8 9 111 111
2017-01-04 12 0  14  15
2017-01-05 16 0  18  19
2017-01-06 20 0  22  23
>>>df['f']=np.nan  # 新建f列并設(shè)置數(shù)值為np.nan
       a b  c  d  f
2017-01-01  0 1  2  3 NaN
2017-01-02  4 5  6  7 NaN
2017-01-03  8 9 111 111 NaN
2017-01-04 12 0  14  15 NaN
2017-01-05 16 0  18  19 NaN
2017-01-06 20 0  22  23 NaN
>>>
# 用上面的方法也可以加上`Series`序列，但是必須與列長度一致
>>>df['e']=pd.Series(np.arange(6),index=dates)
>>>df
       a b  c  d  f e
2017-01-01  0 1  2  3 NaN 0
2017-01-02  4 5  6  7 NaN 1
2017-01-03  8 9 111 111 NaN 2
2017-01-04 12 0  14  15 NaN 3
2017-01-05 16 0  18  19 NaN 4
2017-01-06 20 0  22  23 NaN 5

4. 處理丟失數(shù)據(jù)

有時候我們的數(shù)據(jù)中會有一些空的或者缺失(NaN)數(shù)據(jù)，使用dropna可以選擇性的刪除或填補(bǔ)這些NaN數(shù)據(jù)。drop函數(shù)可以選擇性的刪除行或者列，drop_duplicates去除冗余。fillna則將NaN值用其他值替換。操作后不改變原值，若要保存更改需重新賦值。

>>>import numpy as np
>>>import pandas as pd
>>>df=pd.DataFrame(np.arange(24).reshape(6,4),index=pd.date_range('20170101',periods=6),columns=['a','b','c','d'])
>>>df
       a  b  c  d
2017-01-01  0  1  2  3
2017-01-02  4  5  6  7
2017-01-03  8  9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
>>>df.iloc[1,3]=np.nan
>>>di.iloc[3,2]=np.nan
>>>df.
       a  b   c   d
2017-01-01  0  1  2.0  3.0
2017-01-02  4  5  6.0  NaN
2017-01-03  8  9 10.0 11.0
2017-01-04 12 13  NaN 15.0
2017-01-05 16 17 18.0 19.0
2017-01-06 20 21 22.0 23.0
>>>df.dropna(axis=0,how='any') # axis=0(1)表示將含有NaN的行(列)刪除。
   # how='any'表示只要行(或列，視axis取值而定)含有NaN則將該行(列)刪除，
   # how='all'表示當(dāng)某行(列)全部為NaN時才刪除
       a  b   c   d
2017-01-01  0  1  2.0  3.0
2017-01-03  8  9 10.0 11.0
2017-01-05 16 17 18.0 19.0
2017-01-06 20 21 22.0 23.0
>>>df.fillna(value=55)
       a  b   c   d
2017-01-01  0  1  2.0  3.0
2017-01-02  4  5  6.0 55.0
2017-01-03  8  9 10.0 11.0
2017-01-04 12 13 55.0 15.0
2017-01-05 16 17 18.0 19.0
2017-01-06 20 21 22.0 23.0

還可以利用函數(shù)來檢查數(shù)據(jù)中是否有或者全部為NaN

>>>np.any(df.isnull())==True
True
>>>np.all(df.isnull())==True
False

5. 數(shù)據(jù)的導(dǎo)入以及導(dǎo)出

一般excel文件以csv方式讀入，pd.read_csv(file)，data保存為filedata.to_csv(file)。

6. 數(shù)據(jù)添加合并

本節(jié)主要學(xué)習(xí)Pandas的一些簡單基本的數(shù)據(jù)添加合并方法：concat,append。

concat合并方式類似于Numpy的concatenate方法，可橫向或者豎向合并。

>>>import numpy as np
>>>import pandas as pd
>>> df1=pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
>>> df2=pd.DataFrame(np.ones((3,4))*1,columns=['a','b','c','d'])
>>> df3=pd.DataFrame(np.ones((3,4))*2,columns=['a','b','c','d'])
>>>res=pd.concat([df1,df2,df3],axis=0) 
# axis=0表示按行堆疊合并，axis=1表示按列左右合并
>>>res
    a  b  c  d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
>>>
# 使用ignore_index=True參數(shù)可以重置行標(biāo)簽
>>>res=pd.concat([df1,df2,df3],axis=0,ignore_index=True)
>>>res
    a  b  c  d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0

join參數(shù)提供了更多樣化的合并方式。join=outer為默認(rèn)值,表示將幾個合并的數(shù)據(jù)都用上，具有相同列標(biāo)簽的合二為一，上下合并，不同列標(biāo)簽的獨(dú)自成列，原來沒有數(shù)值的位置以NaN填充；join=inner則只將具有相同列標(biāo)簽的(行)列上下合并，其余的列舍棄。簡言之，outer代表并集，inner代表交集**。

>>>import numpy as np
>>>import pandas as pd
>>>df1=pd.DataFrame(np.ones((3,4)),index=[1,2,3],columns=['a','b','c','d'])
>>>df2=pd.DataFrame(np.ones((3,4))*2,index=[1,2,3],columns=['b','c','d','e'])
>>>res=pd.concat([df1,df2],axis=0,join='outer')
>>>res
    a  b  c  d  e
1 1.0 1.0 1.0 1.0 NaN
2 1.0 1.0 1.0 1.0 NaN
3 1.0 1.0 1.0 1.0 NaN
1 NaN 2.0 2.0 2.0 2.0
2 NaN 2.0 2.0 2.0 2.0
3 NaN 2.0 2.0 2.0 2.0
>>>res1=pd.concat([df1,df2],axis=1,join='outer') 
 # axis=1表示按列左右合并具有相同的行標(biāo)簽的，其余的各成一行，NaN補(bǔ)齊空缺
>>>res1
    a  b  c  d  b  c  d  e
1 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
>>>res2=pd.concat([df1,df2],axis=0,join='inner',ignore_index=True) 
# 將具有相同列標(biāo)簽的列上下合并
>>>res2
   b  c  d
0 1.0 1.0 1.0
1 1.0 1.0 1.0
2 1.0 1.0 1.0
3 2.0 2.0 2.0
4 2.0 2.0 2.0
5 2.0 2.0 2.0

join_axes參數(shù)可以設(shè)定參考系，以設(shè)定的參考來合并，參考系中沒有的舍棄掉

>>>import numpy as np
>>>import pandas as pd
>>>df1=pd.DataFrame(np.ones((3,4)),index=[1,2,3],columns=['a','b','c','d'])
>>> df2=pd.DataFrame(np.ones((3,4))*2,index=[2,3,4],columns=['b','c','d','e'])
>>>res3=pd.concat([df1,df2],axis=0,join_axes=[df1.columns])
# 以df1的列標(biāo)簽為參考上下合并擁有相同列標(biāo)簽的列
>>>res3
    a  b  c  d
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
3 1.0 1.0 1.0 1.0
2 NaN 2.0 2.0 2.0
3 NaN 2.0 2.0 2.0
4 NaN 2.0 2.0 2.0
>>>res4=pd.concat([df1,df2],axis=1,join_axes=[df1.index])
# 以df1行標(biāo)簽為參考，左右合并擁有相同行標(biāo)簽的各列
    a  b  c  d  b  c  d  e
1 1.0 1.0 1.0 1.0 NaN NaN NaN NaN
2 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0
3 1.0 1.0 1.0 1.0 2.0 2.0 2.0 2.0

append只有上下合并，沒有左右合并

>>>df1=pd.DataFrame(np.ones((3,4)),index=[1,2,3],columns=['a','b','c','d'])
>>> df2=pd.DataFrame(np.ones((3,4))*2,index=[2,3,4],columns=['b','c','d','e'])
>>>res5=df1.append(df2,ignore_index=True)
>>>res5
    a  b  c  d  e
0 1.0 1.0 1.0 1.0 NaN
1 1.0 1.0 1.0 1.0 NaN
2 1.0 1.0 1.0 1.0 NaN
3 NaN 2.0 2.0 2.0 2.0
4 NaN 2.0 2.0 2.0 2.0
5 NaN 2.0 2.0 2.0 2.0

7. Pandas高級合并：merge

merge合并與concat類似，只是merge可以通過一個或多個鍵將兩個數(shù)據(jù)集的行連接起來。

merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, 
sort=False, suffixes=('_x', '_y'), copy=True, indicator=False)

參數(shù)說明：

left與right：兩個不同的DataFrame
how：指的是合并(連接)的方式有inner(內(nèi)連接),left(左外連接),right(右外連接),outer(全外連接);默認(rèn)為inner
on : 指的是用于連接的列索引名稱。必須存在右右兩個DataFrame對象中，如果沒有指定且其他參數(shù)也未指定則以兩個DataFrame的列名交集做為連接鍵
left_on：左側(cè)DataFrame中用作連接鍵的列名;這個參數(shù)中左右列名不相同，但代表的含義相同時非常有用。
right_on：右側(cè)DataFrame中用作連接鍵的列名
left_index：使用左側(cè)DataFrame中的行索引做為連接鍵
right_index：使用右側(cè)DataFrame中的行索引做為連接鍵
sort：默認(rèn)為True，將合并的數(shù)據(jù)進(jìn)行排序。在大多數(shù)情況下設(shè)置為False可以提高性能
suffixes：字符串值組成的元組，用于指定當(dāng)左右DataFrame存在相同列名時在列名后面附加的后綴名稱，默認(rèn)為('_x','_y')
copy：默認(rèn)為True,總是將數(shù)據(jù)復(fù)制到數(shù)據(jù)結(jié)構(gòu)中；大多數(shù)情況下設(shè)置為False可以提高性能
indicator：顯示合并數(shù)據(jù)中來源情況；如只來自己于左邊(left_only)、兩者(both)

>>>import pandas as pd
>>>df1=pd.DataFrame({'key':['k0','k1','k2','k3'],'A':['a0','a1','a2','a3'],'B':['b0','b1','b2','b3']})
>>>df2=pd.DataFrame({'key':['k0','k1','k2','k3'],'C':['c0','c1','c2','c3'],'D':['d0','d1','d2','d3']})
>>> res=pd.merge(df1,df2,on='key',indicator=True)
>>>res
  A  B key  C  D _merge
0 a0 b0 k0 c0 d0  both
1 a1 b1 k1 c1 d1  both
2 a2 b2 k2 c2 d2  both
3 a3 b3 k3 c3 d3  both

依據(jù)行index合并與依據(jù)列key合并用法類似

>>>res2=pd.merge(df1,df2,left_index=True,right_index=True,indicator=True)
>>>res2
  A  B key_x  C  D key_y _merge
0 a0 b0  k0 c0 d0  k0  both
1 a1 b1  k1 c1 d1  k1  both
2 a2 b2  k2 c2 d2  k2  both
3 a3 b3  k3 c3 d3  k3  both

以上就是本文的全部內(nèi)容，希望對大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: