Pandas高級(jí)教程之Pandas中的GroupBy操作
簡(jiǎn)介
pandas中的DF數(shù)據(jù)類型可以像數(shù)據(jù)庫表格一樣進(jìn)行g(shù)roupby操作。通常來說groupby操作可以分為三部分:分割數(shù)據(jù),應(yīng)用變換和和合并數(shù)據(jù)。
本文將會(huì)詳細(xì)講解Pandas中的groupby操作。
分割數(shù)據(jù)
分割數(shù)據(jù)的目的是將DF分割成為一個(gè)個(gè)的group。為了進(jìn)行g(shù)roupby操作,在創(chuàng)建DF的時(shí)候需要指定相應(yīng)的label:
df = pd.DataFrame( ...: { ...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"], ...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"], ...: "C": np.random.randn(8), ...: "D": np.random.randn(8), ...: } ...: ) ...: df Out[61]: A B C D 0 foo one -0.490565 -0.233106 1 bar one 0.430089 1.040789 2 foo two 0.653449 -1.155530 3 bar three -0.610380 -0.447735 4 foo two -0.934961 0.256358 5 bar two -0.256263 -0.661954 6 foo one -1.132186 -0.304330 7 foo three 2.129757 0.445744
默認(rèn)情況下,groupby的軸是x軸??梢砸涣術(shù)roup,也可以多列g(shù)roup:
In [8]: grouped = df.groupby("A") In [9]: grouped = df.groupby(["A", "B"])
多index
在0.24版本中,如果我們有多index,可以從中選擇特定的index進(jìn)行g(shù)roup:
In [10]: df2 = df.set_index(["A", "B"]) In [11]: grouped = df2.groupby(level=df2.index.names.difference(["B"])) In [12]: grouped.sum() Out[12]: C D A bar -1.591710 -1.739537 foo -0.752861 -1.402938
get_group
get_group 可以獲取分組之后的數(shù)據(jù):
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]}) In [25]: df3.groupby(["X"]).get_group("A") Out[25]: X Y 0 A 1 2 A 3 In [26]: df3.groupby(["X"]).get_group("B") Out[26]: X Y 1 B 4 3 B 2
dropna
默認(rèn)情況下,NaN數(shù)據(jù)會(huì)被排除在groupby之外,通過設(shè)置 dropna=False 可以允許NaN數(shù)據(jù):
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]] In [28]: df_dropna = pd.DataFrame(df_list, columns=["a", "b", "c"]) In [29]: df_dropna Out[29]: a b c 0 1 2.0 3 1 1 NaN 4 2 2 1.0 3 3 1 2.0 2
# Default ``dropna`` is set to True, which will exclude NaNs in keys In [30]: df_dropna.groupby(by=["b"], dropna=True).sum() Out[30]: a c b 1.0 2 3 2.0 2 5 # In order to allow NaN in keys, set ``dropna`` to False In [31]: df_dropna.groupby(by=["b"], dropna=False).sum() Out[31]: a c b 1.0 2 3 2.0 2 5 NaN 1 4
groups屬性
groupby對(duì)象有個(gè)groups屬性,它是一個(gè)key-value字典,key是用來分類的數(shù)據(jù),value是分類對(duì)應(yīng)的值。
In [34]: grouped = df.groupby(["A", "B"]) In [35]: grouped.groups Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo', 'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]} In [36]: len(grouped) Out[36]: 6
index的層級(jí)
對(duì)于多級(jí)index對(duì)象,groupby可以指定group的index層級(jí):
In [40]: arrays = [ ....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ....: ["one", "two", "one", "two", "one", "two", "one", "two"], ....: ] ....: In [41]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"]) In [42]: s = pd.Series(np.random.randn(8), index=index) In [43]: s Out[43]: first second bar one -0.919854 two -0.042379 baz one 1.247642 two -0.009920 foo one 0.290213 two 0.495767 qux one 0.362949 two 1.548106 dtype: float64
group第一級(jí):
In [44]: grouped = s.groupby(level=0) In [45]: grouped.sum() Out[45]: first bar -0.962232 baz 1.237723 foo 0.785980 qux 1.911055 dtype: float64
group第二級(jí):
In [46]: s.groupby(level="second").sum() Out[46]: second one 0.980950 two 1.991575 dtype: float64
group的遍歷
得到group對(duì)象之后,我們可以通過for語句來遍歷group:
In [62]: grouped = df.groupby('A') In [63]: for name, group in grouped: ....: print(name) ....: print(group) ....: bar A B C D 1 bar one 0.254161 1.511763 3 bar three 0.215897 -0.990582 5 bar two -0.077118 1.211526 foo A B C D 0 foo one -0.575247 1.346061 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580
如果是多字段group,group的名字是一個(gè)元組:
In [64]: for name, group in df.groupby(['A', 'B']): ....: print(name) ....: print(group) ....: ('bar', 'one') A B C D 1 bar one 0.254161 1.511763 ('bar', 'three') A B C D 3 bar three 0.215897 -0.990582 ('bar', 'two') A B C D 5 bar two -0.077118 1.211526 ('foo', 'one') A B C D 0 foo one -0.575247 1.346061 6 foo one -0.408530 0.268520 ('foo', 'three') A B C D 7 foo three -0.862495 0.02458 ('foo', 'two') A B C D 2 foo two -1.143704 1.627081 4 foo two 1.193555 -0.441652
聚合操作
分組之后,就可以進(jìn)行聚合操作:
In [67]: grouped = df.groupby("A") In [68]: grouped.aggregate(np.sum) Out[68]: C D A bar 0.392940 1.732707 foo -1.796421 2.824590 In [69]: grouped = df.groupby(["A", "B"]) In [70]: grouped.aggregate(np.sum) Out[70]: C D A B bar one 0.254161 1.511763 three 0.215897 -0.990582 two -0.077118 1.211526 foo one -0.983776 1.614581 three -0.862495 0.024580 two 0.049851 1.185429
對(duì)于多index數(shù)據(jù)來說,默認(rèn)返回值也是多index的。如果想使用新的index,可以添加 as_index = False:
In [71]: grouped = df.groupby(["A", "B"], as_index=False) In [72]: grouped.aggregate(np.sum) Out[72]: A B C D 0 bar one 0.254161 1.511763 1 bar three 0.215897 -0.990582 2 bar two -0.077118 1.211526 3 foo one -0.983776 1.614581 4 foo three -0.862495 0.024580 5 foo two 0.049851 1.185429 In [73]: df.groupby("A", as_index=False).sum() Out[73]: A C D 0 bar 0.392940 1.732707 1 foo -1.796421 2.824590
上面的效果等同于reset_index
In [74]: df.groupby(["A", "B"]).sum().reset_index()
grouped.size() 計(jì)算group的大小:
In [75]: grouped.size() Out[75]: A B size 0 bar one 1 1 bar three 1 2 bar two 1 3 foo one 2 4 foo three 1 5 foo two 2
grouped.describe() 描述group的信息:
In [76]: grouped.describe() Out[76]: C ... D count mean std min 25% 50% ... std min 25% 50% 75% max 0 1.0 0.254161 NaN 0.254161 0.254161 0.254161 ... NaN 1.511763 1.511763 1.511763 1.511763 1.511763 1 1.0 0.215897 NaN 0.215897 0.215897 0.215897 ... NaN -0.990582 -0.990582 -0.990582 -0.990582 -0.990582 2 1.0 -0.077118 NaN -0.077118 -0.077118 -0.077118 ... NaN 1.211526 1.211526 1.211526 1.211526 1.211526 3 2.0 -0.491888 0.117887 -0.575247 -0.533567 -0.491888 ... 0.761937 0.268520 0.537905 0.807291 1.076676 1.346061 4 1.0 -0.862495 NaN -0.862495 -0.862495 -0.862495 ... NaN 0.024580 0.024580 0.024580 0.024580 0.024580 5 2.0 0.024925 1.652692 -1.143704 -0.559389 0.024925 ... 1.462816 -0.441652 0.075531 0.592714 1.109898 1.627081 [6 rows x 16 columns]
通用聚合方法
下面是通用的聚合方法:
函數(shù) | 描述 |
---|---|
mean() |
平均值 |
sum() |
求和 |
size() |
計(jì)算size |
count() |
group的統(tǒng)計(jì) |
std() |
標(biāo)準(zhǔn)差 |
var() |
方差 |
sem() |
均值的標(biāo)準(zhǔn)誤 |
describe() |
統(tǒng)計(jì)信息描述 |
first() |
第一個(gè)group值 |
last() |
最后一個(gè)group值 |
nth() |
第n個(gè)group值 |
min() |
最小值 |
max() |
最大值 |
可以同時(shí)指定多個(gè)聚合方法:
In [81]: grouped = df.groupby("A") In [82]: grouped["C"].agg([np.sum, np.mean, np.std]) Out[82]: sum mean std A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265
可以重命名:
In [84]: ( ....: grouped["C"] ....: .agg([np.sum, np.mean, np.std]) ....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"}) ....: ) ....: Out[84]: foo bar baz A bar 0.392940 0.130980 0.181231 foo -1.796421 -0.359284 0.912265
NamedAgg
NamedAgg 可以對(duì)聚合進(jìn)行更精準(zhǔn)的定義,它包含 column 和aggfunc 兩個(gè)定制化的字段。
In [88]: animals = pd.DataFrame( ....: { ....: "kind": ["cat", "dog", "cat", "dog"], ....: "height": [9.1, 6.0, 9.5, 34.0], ....: "weight": [7.9, 7.5, 9.9, 198.0], ....: } ....: ) ....: In [89]: animals Out[89]: kind height weight 0 cat 9.1 7.9 1 dog 6.0 7.5 2 cat 9.5 9.9 3 dog 34.0 198.0 In [90]: animals.groupby("kind").agg( ....: min_height=pd.NamedAgg(column="height", aggfunc="min"), ....: max_height=pd.NamedAgg(column="height", aggfunc="max"), ....: average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean), ....: ) ....: Out[90]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
或者直接使用一個(gè)元組:
In [91]: animals.groupby("kind").agg( ....: min_height=("height", "min"), ....: max_height=("height", "max"), ....: average_weight=("weight", np.mean), ....: ) ....: Out[91]: min_height max_height average_weight kind cat 9.1 9.5 8.90 dog 6.0 34.0 102.75
不同的列指定不同的聚合方法
通過給agg方法傳入一個(gè)字典,可以指定不同的列使用不同的聚合:
In [95]: grouped.agg({"C": "sum", "D": "std"}) Out[95]: C D A bar 0.392940 1.366330 foo -1.796421 0.884785
轉(zhuǎn)換操作
轉(zhuǎn)換是將對(duì)象轉(zhuǎn)換為同樣大小對(duì)象的操作。在數(shù)據(jù)分析的過程中,經(jīng)常需要進(jìn)行數(shù)據(jù)的轉(zhuǎn)換操作。
可以接lambda操作:
In [112]: ts.groupby(lambda x: x.year).transform(lambda x: x.max() - x.min())
填充na值:
In [121]: transformed = grouped.transform(lambda x: x.fillna(x.mean()))
過濾操作
filter方法可以通過lambda表達(dá)式來過濾我們不需要的數(shù)據(jù):
In [136]: sf = pd.Series([1, 1, 2, 3, 3, 3]) In [137]: sf.groupby(sf).filter(lambda x: x.sum() > 2) Out[137]: 3 3 4 3 5 3 dtype: int64
Apply操作
有些數(shù)據(jù)可能不適合進(jìn)行聚合或者轉(zhuǎn)換操作,Pandas提供了一個(gè) apply
方法,用來進(jìn)行更加靈活的轉(zhuǎn)換操作。
In [156]: df Out[156]: A B C D 0 foo one -0.575247 1.346061 1 bar one 0.254161 1.511763 2 foo two -1.143704 1.627081 3 bar three 0.215897 -0.990582 4 foo two 1.193555 -0.441652 5 bar two -0.077118 1.211526 6 foo one -0.408530 0.268520 7 foo three -0.862495 0.024580 In [157]: grouped = df.groupby("A") # could also just call .describe() In [158]: grouped["C"].apply(lambda x: x.describe()) Out[158]: A bar count 3.000000 mean 0.130980 std 0.181231 min -0.077118 25% 0.069390 ... foo min -1.143704 25% -0.862495 50% -0.575247 75% -0.408530 max 1.193555 Name: C, Length: 16, dtype: float64
可以外接函數(shù):
In [159]: grouped = df.groupby('A')['C'] In [160]: def f(group): .....: return pd.DataFrame({'original': group, .....: 'demeaned': group - group.mean()}) .....: In [161]: grouped.apply(f) Out[161]: original demeaned 0 -0.575247 -0.215962 1 0.254161 0.123181 2 -1.143704 -0.784420 3 0.215897 0.084917 4 1.193555 1.552839 5 -0.077118 -0.208098 6 -0.408530 -0.049245 7 -0.862495 -0.503211
本文已收錄于 http://www.flydean.com/11-python-pandas-groupby/
最通俗的解讀,最深刻的干貨,最簡(jiǎn)潔的教程,眾多你不知道的小技巧等你來發(fā)現(xiàn)!
到此這篇關(guān)于Pandas高級(jí)教程之Pandas中的GroupBy操作的文章就介紹到這了,更多相關(guān)Pandas GroupBy用法內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- Pandas實(shí)現(xiàn)groupby分組統(tǒng)計(jì)方法實(shí)例
- pandas中g(shù)roupby操作實(shí)現(xiàn)
- pandas中df.groupby()方法深入講解
- pandas?groupby?用法實(shí)例詳解
- Pandas數(shù)據(jù)分析之groupby函數(shù)用法實(shí)例詳解
- pandas中pd.groupby()的用法詳解
- 詳解Pandas中GroupBy對(duì)象的使用
- Pandas實(shí)現(xiàn)groupby分組統(tǒng)計(jì)的實(shí)踐
- Pandas中GroupBy具體用法詳解
- Pandas groupby apply agg 的區(qū)別 運(yùn)行自定義函數(shù)說明
- pandas groupby分組對(duì)象的組內(nèi)排序解決方案
- pandas groupby()的使用小結(jié)
相關(guān)文章
Python實(shí)現(xiàn)初始化不同的變量類型為空值
這篇文章主要介紹了Python實(shí)現(xiàn)初始化不同的變量類型為空值,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過來看看吧2020-06-06python PyQt5的窗口界面的各種交互邏輯實(shí)現(xiàn)
PyQt5是一個(gè)Python綁定庫,用于Qt C++ GUI框架,它允許開發(fā)者使用Python語言創(chuàng)建跨平臺(tái)的應(yīng)用程序,并利用豐富的Qt圖形用戶界面功能,本文介紹了python中PyQt5窗口界面的各種交互邏輯實(shí)現(xiàn),需要的朋友可以參考下2024-07-07Python中優(yōu)雅處理JSON文件的方法實(shí)例
JSON是一種輕量級(jí)的數(shù)據(jù)交換格式,JSON采用完全獨(dú)立于語言的文本格式,但是也使用了類似于C語言家族的習(xí)慣,這篇文章主要給大家介紹了關(guān)于Python中優(yōu)雅處理JSON文件的相關(guān)資料,需要的朋友可以參考下2021-12-12如何在Django項(xiàng)目中引入靜態(tài)文件
這篇文章主要介紹了如何在Django項(xiàng)目中引入靜態(tài)文件,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-07-07解決使用export_graphviz可視化樹報(bào)錯(cuò)的問題
今天小編就為大家分享一篇解決使用export_graphviz可視化樹報(bào)錯(cuò)的問題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過來看看吧2019-08-08