pandas聚合分組的具體使用

更新時(shí)間：2024年03月17日 16:30:15 作者：金灰

使用數(shù)據(jù)庫(kù)時(shí),我們利用查詢操作對(duì)各列或各行中的數(shù)據(jù)進(jìn)行分組,可以針對(duì)其中的每一組數(shù)據(jù)進(jìn)行各種不同的操作,本文主要介紹了pandas聚合分組,感興趣的可以了解一下

1.分組操作

數(shù)據(jù)的分組與聚合是關(guān)系型數(shù)據(jù)庫(kù)中比較常見(jiàn)術(shù)語(yǔ)。

使用數(shù)據(jù)庫(kù)時(shí)，我們利用查詢操作對(duì)各列或各行中的數(shù)據(jù)進(jìn)行分組，可以針對(duì)其中的每一組數(shù)據(jù)進(jìn)行各種不同的操作。

1.1 分組步驟

在數(shù)據(jù)分析中，經(jīng)常會(huì)遇到這樣的情況：根據(jù)某一列（或多列）標(biāo)簽把數(shù)據(jù)劃分為不同的組別，然后再對(duì)其進(jìn)行數(shù)據(jù)分析。

比如，某網(wǎng)站對(duì)注冊(cè)用戶的性別或者年齡等進(jìn)行分組，從而研究出網(wǎng)站用戶的畫(huà)像（特點(diǎn)）。

在 Pandas 中，要完成數(shù)據(jù)的分組操作，需要使用 groupby() 函數(shù)，它和 SQL 的GROUP BY操作非常相似 .

在劃分出來(lái)的組（group）上應(yīng)用一些統(tǒng)計(jì)函數(shù)，從而達(dá)到數(shù)據(jù)分析的目的，比如對(duì)分組數(shù)據(jù)進(jìn)行聚合、轉(zhuǎn)換，或者過(guò)濾。這個(gè)過(guò)程主要包含以下三步：

拆分（Spliting）：表示對(duì)數(shù)據(jù)進(jìn)行分組；
應(yīng)用（Applying）：對(duì)分組數(shù)據(jù)應(yīng)用聚合函數(shù)，進(jìn)行相應(yīng)計(jì)算；
合并（Combining）：最后匯總計(jì)算結(jié)果。

1.2 基本使用

演示代碼:

import pandas as pd
import numpy as np
?
company = ["A","B","C"]
df_data= pd.DataFrame({
    "company":[company[x] for x in np.random.randint(0,len(company),10)],
    "salary":np.random.randint(5,50,10),
    "age":np.random.randint(15,50,10)
})
print(df_data)
--------------------------
 company  salary  age
0       A      29   16
1       C      21   23
2       A      15   24
3       A      45   47
4       A      45   41
5       C      46   39
6       B      24   24
7       A      21   18
8       B      33   37
9       C      30   18
?

在pandas中，實(shí)現(xiàn)分組操作的代碼很簡(jiǎn)單，僅需一行代碼，在這里，將上面的數(shù)據(jù)集按照字段進(jìn)行劃分：

group = df_data.groupby("company")
group
?
# 生成DataFrameGroupBy對(duì)象
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000250E4329EA0>
?
----------
# 轉(zhuǎn)換為列表，則更加直觀地看出效果
list(group)
[('A',
    company  salary  age
  1       A      48   47
  3       A      30   29),
 ('B',
    company  salary  age
  0       B      30   21
  2       B      25   44
  4       B      49   18
  5       B      24   19
  7       B      11   37),
 ('C',
    company  salary  age
  6       C      35   42
  8       C       6   26
  9       C      24   45)]

groupby的過(guò)程就是將原有的DataFrame按照groupby的字段（這里是company），劃分為若干個(gè)分組DataFrame

1.3 分組聚合

聚合操作是groupby后非常常見(jiàn)的操作，聚合操作可以用來(lái)求和、均值、最大值、最小值等.

--分組后的操作.

函數(shù)	作用
max	最大值
min	最小值
sum	求和
mean	求平均值
median	中位數(shù)
std	標(biāo)準(zhǔn)差
var	方差
count	計(jì)數(shù)

# 按照company進(jìn)行分組，然后求平均值
df_data.groupby("company").agg("mean")
---------
 salary        age
company                      
A        42.000000  40.000000
B        35.333333  36.666667
C        16.666667  25.666667
...

1.4 實(shí)操(練習(xí))

import pandas as pd
import numpy as np
?
country = ["中國(guó)","美國(guó)","英國(guó)"]
data = {
    "country": [country[x] for x in np.random.randint(0,len(country),10)],
    "year": np.random.randint(1990, 2000, size=10),
    "GDP": np.random.randint(25000,30000,size=10)
}
df_data = pd.DataFrame(data)
print(df_data)
-------------------
  country  year    GDP
0      英國(guó)  1996  28767
1      中國(guó)  1991  25541
2      英國(guó)  1996  28251
3      美國(guó)  1992  29543
4      中國(guó)  1998  28031
5      英國(guó)  1993  28510
6      美國(guó)  1996  27576
7      美國(guó)  1993  27087
8      美國(guó)  1998  28345
9      英國(guó)  1999  27247

1-創(chuàng)建groupby分組對(duì)象

使用 groupby() 可以沿著任意軸分組。您可以把分組時(shí)指定的鍵（key）作為每組的組名.

df_data.groupby('year')
#返回對(duì)象地址.

2-查看分組結(jié)果

通過(guò)調(diào)用groups屬性查看分組結(jié)果.

print(df_data.groupby('year').groups)
#{1990: [2, 4], 1992: [0, 7], 1994: [8], 1995: [5], 1997: [3], 1998: [1, 6, 9]}

3--演練操作

# 計(jì)算每一年的GDP和year的平均值
df_data.groupby("year")[["GDP", "year"]].mean()
----------------------
          GDP    year
year                 
1990  29592.0  1990.0
1992  27173.5  1992.0
1993  28316.0  1993.0
1994  29401.0  1994.0
1997  26791.0  1997.0
1998  29947.0  1998.0
1999  28290.0  1999.0
?
?
# 計(jì)算每個(gè)國(guó)家GDP的平均值和收入的中位數(shù)
df_data.groupby("country").agg({"GDP":"mean","year":"median"})
------------------------
             GDP    year
country                 
中國(guó)       27256.0  1998.0
美國(guó)       27220.0  1993.0
英國(guó)       26834.0  1993.0
?
?
?
# 計(jì)算每個(gè)國(guó)家每年的GDP和year的平均值和方差
df_data.groupby("country")[["GDP", "year"]].agg(["mean","std"])
-------------------------------------------------------------
                  GDP                      year          
                 mean          std         mean       std
country                                                  
中國(guó)       27922.285714   727.771419  1994.571429  3.309438
美國(guó)       27824.000000  2332.038164  1996.000000  0.000000
英國(guó)       26188.000000          NaN  1994.000000       NaN
?
?
?
# 計(jì)算每一年，中國(guó)和美國(guó)的GDP和year的平均值
df_data.groupby(["year", "country"])[["GDP", "year"]].mean()
-----------------------------------------------------
                  GDP    year
year country                 
1990 美國(guó)       27325.5  1990.0
     英國(guó)       28920.0  1990.0
1991 美國(guó)       28691.0  1991.0
     英國(guó)       26217.0  1991.0
1992 中國(guó)       26445.0  1992.0
1995 美國(guó)       28058.0  1995.0
1996 英國(guó)       25210.0  1996.0
1999 美國(guó)       26850.0  1999.0
     英國(guó)       27193.0  1999.0
-----------------------------------
    
    
?
# 統(tǒng)計(jì)每個(gè)州出現(xiàn)的國(guó)家數(shù)
df_data.groupby("year")["country"].count()
?
# 統(tǒng)計(jì)個(gè)數(shù)去重
df_data.groupby("year")[["country"]].nunique()
?
?
?
?
# 統(tǒng)計(jì)出現(xiàn)的國(guó)家數(shù)
df_data["country"].nunique()
?
# 統(tǒng)計(jì)出現(xiàn)的國(guó)家
df_data["country"].unique()

2.操作回顧

演示代碼:

df_data = pd.DataFrame(
    np.random.randint(60,95,size=(6,6)),
    index=["張三","李四","王五","趙六","坤哥","凡哥"],
    columns=["語(yǔ)文","數(shù)學(xué)","英語(yǔ)","政治","歷史","地理"]
)
print(df_data)
-----------------
   語(yǔ)文  數(shù)學(xué)  英語(yǔ)  政治  歷史  地理
張三  62  79  94  81  68  63
李四  66  63  88  87  69  83
王五  94  62  89  60  84  71
趙六  84  85  86  76  93  74
坤哥  92  82  81  62  62  69
凡哥  70  68  71  70  62  93

---顯示df_data的基礎(chǔ)信息

df_data.info()
#--
<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, 張三 to 凡哥
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   語(yǔ)文      6 non-null      int32
 1   數(shù)學(xué)      6 non-null      int32
 2   英語(yǔ)      6 non-null      int32
 3   政治      6 non-null      int32
 4   歷史      6 non-null      int32
 5   地理      6 non-null      int32
dtypes: int32(6)
memory usage: 192.0+ bytes
    
    
    
df.describe()
#--
              語(yǔ)文         數(shù)學(xué)         英語(yǔ)         政治         歷史         地理
count   6.000000   6.000000   6.000000   6.000000   6.000000   6.000000
mean   69.833333  75.833333  77.166667  82.000000  79.833333  71.166667
std     7.704977   6.337718  12.221566   8.694826  10.888832   8.518607
min    61.000000  66.000000  62.000000  70.000000  66.000000  62.000000
25%    66.000000  72.000000  66.500000  78.250000  71.500000  66.250000
50%    67.000000  78.500000  80.500000  80.000000  81.000000  69.000000
75%    74.000000  79.750000  87.000000  87.750000  86.750000  74.000000
max    82.000000  82.000000  89.000000  94.000000  94.000000  86.000000

2.1 索引切片

loc() 好用,行和列都能切.

1-展示df_data的前3行 .iloc[ ]

df_data.iloc[:3]

2-取出df_data的指定列

df_data.loc[:,["語(yǔ)文","英語(yǔ)"]]
df_data[["語(yǔ)文","英語(yǔ)"]]

3-取出指定行與列.loc[ ]

df_data.loc[df_data.index[[0,2,4]],["語(yǔ)文","數(shù)學(xué)","英語(yǔ)"]]

4-取出語(yǔ)文大于70的行

df_data[df_data["語(yǔ)文"] > 70]
df_data[(df_data["語(yǔ)文"] > 70) & (df_data["數(shù)學(xué)"]< 70)]

5-統(tǒng)計(jì)每個(gè)語(yǔ)文列成績(jī)出現(xiàn)的次數(shù)

df_data["語(yǔ)文"].value_counts()

到此這篇關(guān)于pandas聚合分組的文章就介紹到這了,更多相關(guān)pandas聚合分組內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片