快捷導(dǎo)航

分享20個(gè)Pandas短小精悍的數(shù)據(jù)操作

更新時(shí)間：2022年04月25日 09:21:15 作者：東哥起飛

本文為大家整理了一個(gè)pandas數(shù)據(jù)操作的大集合，共20個(gè)功能，個(gè)個(gè)短小精悍，一次讓你愛(ài)個(gè)夠，感興趣的小伙伴快跟隨小編一起學(xué)習(xí)一下吧

1. ExcelWriter

很多時(shí)候dataframe里面有中文，如果直接輸出到csv里，中文將顯示亂碼。而Excel就不一樣了，ExcelWriter是pandas的一個(gè)類(lèi)，可以使dataframe數(shù)據(jù)框直接輸出到excel文件，并可以指定sheets名稱(chēng)。

df1?=?pd.DataFrame([["AAA",?"BBB"]],?columns=["Spam",?"Egg"])
df2?=?pd.DataFrame([["ABC",?"XYZ"]],?columns=["Foo",?"Bar"])
with?ExcelWriter("path_to_file.xlsx")?as?writer:
????df1.to_excel(writer,?sheet_name="Sheet1")
????df2.to_excel(writer,?sheet_name="Sheet2")

如果有時(shí)間變量，輸出時(shí)還可以date_format指定時(shí)間的格式。另外，它還可以通過(guò)mode設(shè)置輸出到已有的excel文件中，非常靈活。

with?ExcelWriter("path_to_file.xlsx",?mode="a",?engine="openpyxl")?as?writer:
????df.to_excel(writer,?sheet_name="Sheet3")

2. pipe

pipe管道函數(shù)可以將多個(gè)自定義函數(shù)裝進(jìn)同一個(gè)操作里，讓整個(gè)代碼更簡(jiǎn)潔，更緊湊。

比如，我們?cè)谧鰯?shù)據(jù)清洗的時(shí)候，往往代碼會(huì)很亂，有去重、去異常值、編碼轉(zhuǎn)換等等。如果使用pipe，將是這樣子的。

diamonds?=?sns.load_dataset("diamonds")

df_preped?=?(diamonds.pipe(drop_duplicates).
??????????????????????pipe(remove_outliers,?['price',?'carat',?'depth']).
??????????????????????pipe(encode_categoricals,?['cut',?'color',?'clarity'])
????????????)

兩個(gè)字，干凈！

3. factorize

factorize這個(gè)函數(shù)類(lèi)似sklearn中LabelEncoder，可以實(shí)現(xiàn)同樣的功能。

#?Mind?the?[0]?at?the?end
diamonds["cut_enc"]?=?pd.factorize(diamonds["cut"])[0]

>>>?diamonds["cut_enc"].sample(5)

52103????2
39813????0
31843????0
10675????0
6634?????0
Name:?cut_enc,?dtype:?int64

區(qū)別是，factorize返回一個(gè)二值元組：編碼的列和唯一分類(lèi)值的列表。

codes,?unique?=?pd.factorize(diamonds["cut"],?sort=True)

>>>?codes[:10]
array([0,?1,?3,?1,?3,?2,?2,?2,?4,?2],?dtype=int64)

>>>?unique
['Ideal',?'Premium',?'Very?Good',?'Good',?'Fair']

4. explode

explode爆炸功能，可以將array-like的值比如列表，炸開(kāi)轉(zhuǎn)換成多行。

data?=?pd.Series([1,?6,?7,?[46,?56,?49],?45,?[15,?10,?12]]).to_frame("dirty")

data.explode("dirty",?ignore_index=True)

5. squeeze

很多時(shí)候，我們用.loc篩選想返回一個(gè)值，但返回的卻是個(gè)series。其實(shí)，只要使用.squeeze()即可完美解決。比如：

#?沒(méi)使用squeeze
subset?=?diamonds.loc[diamonds.index?<?1,?["price"]]
#?使用squeeze
subset.squeeze("columns")

可以看到，壓縮完結(jié)果已經(jīng)是int64的格式了，而不再是series。

6. between

dataframe的篩選方法有很多，常見(jiàn)的loc、isin等等，但其實(shí)還有個(gè)及其簡(jiǎn)潔的方法，專(zhuān)門(mén)篩選數(shù)值范圍的，就是between，用法很簡(jiǎn)單。

diamonds[diamonds["price"]\
??????.between(3500,?3700,?inclusive="neither")].sample(5)

7. T

這是所有的dataframe都有的一個(gè)簡(jiǎn)單屬性，實(shí)現(xiàn)轉(zhuǎn)置功能。它在顯示describe時(shí)可以很好的搭配。

boston.describe().T.head(10)

8. pandas styler

pandas也可以像excel一樣，設(shè)置表格的可視化條件格式，而且只需要一行代碼即可（可能需要一丟丟的前端HTML和CSS基礎(chǔ)知識(shí)）。

>>>?diabetes.describe().T.drop("count",?axis=1)\
?????????????????.style.highlight_max(color="darkred")

當(dāng)然了，條件格式有非常多種。

9. Pandas options

pandas里提供了很多宏設(shè)置選項(xiàng)，被分為下面5大類(lèi)。

dir(pd.options)
['compute',?'display',?'io',?'mode',?'plotting']

一般情況下使用display會(huì)多一點(diǎn)，比如最大、最小顯示行數(shù)，畫(huà)圖方法，顯示精度等等。

pd.options.display.max_columns?=?None
pd.options.display.precision?=?5

10. convert_dtypes

經(jīng)常使用pandas的都知道，pandas對(duì)于經(jīng)常會(huì)將變量類(lèi)型直接變成object，導(dǎo)致后續(xù)無(wú)法正常操作。這種情況可以用convert_dtypes進(jìn)行批量的轉(zhuǎn)換，它會(huì)自動(dòng)推斷數(shù)據(jù)原來(lái)的類(lèi)型，并實(shí)現(xiàn)轉(zhuǎn)換。

sample?=?pd.read_csv(
????"data/station_day.csv",
????usecols=["StationId",?"CO",?"O3",?"AQI_Bucket"],
)

>>>?sample.dtypes

StationId??????object
CO????????????float64
O3????????????float64
AQI_Bucket?????object
dtype:?object

>>>?sample.convert_dtypes().dtypes

StationId??????string
CO????????????float64
O3????????????float64
AQI_Bucket?????string
dtype:?object

11. select_dtypes

在需要篩選變量類(lèi)型的時(shí)候，可以直接用selec _dtypes，通過(guò)include和exclude篩選和排除變量的類(lèi)型。

#?選擇數(shù)值型的變量
diamonds.select_dtypes(include=np.number).head()
#?排除數(shù)值型的變量
diamonds.select_dtypes(exclude=np.number).head()

12. mask

mask可以在自定義條件下快速替換單元值，在很多三方庫(kù)的源碼中經(jīng)常見(jiàn)到。比如下面我們想讓age為50-60以外的單元為空，只需要在con和ohter寫(xiě)好自定義的條件即可。

ages?=?pd.Series([55,?52,?50,?66,?57,?59,?49,?60]).to_frame("ages")

ages.mask(cond=~ages["ages"].between(50,?60),?other=np.nan)

13. 列軸的min、max

雖然大家都知道min和max的功能，但應(yīng)用在列上的應(yīng)該不多見(jiàn)。這對(duì)函數(shù)其實(shí)還可以這么用：

index?=?["Diamonds",?"Titanic",?"Iris",?"Heart?Disease",?"Loan?Default"]
libraries?=?["XGBoost",?"CatBoost",?"LightGBM",?"Sklearn?GB"]

df?=?pd.DataFrame(
????{lib:?np.random.uniform(90,?100,?5)?for?lib?in?libraries},?index=index
)

>>>?df

>>>?df.max(axis=1)

Diamonds?????????99.52684
Titanic??????????99.63650
Iris?????????????99.10989
Heart?Disease????99.31627
Loan?Default?????97.96728
dtype:?float64

14. nlargest、nsmallest

有時(shí)我們不僅想要列的最小值/最大值，還想看變量的前 N 個(gè)或 ~(top N) 個(gè)值。這時(shí)nlargest和nsmallest就派上用場(chǎng)了。

diamonds.nlargest(5,?"price")

15. idmax、idxmin

我們用列軸使用max或min時(shí)，pandas 會(huì)返回最大/最小的值。但我現(xiàn)在不需要具體的值了，我需要這個(gè)最大值的位置。因?yàn)楹芏鄷r(shí)候要鎖定位置之后對(duì)整個(gè)行進(jìn)行操作，比如單提出來(lái)或者刪除等，所以這種需求還是很常見(jiàn)的。

使用idxmax和idxmin即可解決。

>>>?diamonds.price.idxmax()
27749

>>>?diamonds.carat.idxmin()
14

16. value_counts

在數(shù)據(jù)探索的時(shí)候，value_counts是使用很頻繁的函數(shù)，它默認(rèn)是不統(tǒng)計(jì)空值的，但空值往往也是我們很關(guān)心的。如果想統(tǒng)計(jì)空值，可以將參數(shù)dropna設(shè)置為False。

ames_housing?=?pd.read_csv("data/train.csv")

>>>?ames_housing["FireplaceQu"].value_counts(dropna=False,?normalize=True)

NaN????0.47260
Gd?????0.26027
TA?????0.21438
Fa?????0.02260
Ex?????0.01644
Po?????0.01370
Name:?FireplaceQu,?dtype:?float64

17. clip

異常值檢測(cè)是數(shù)據(jù)分析中常見(jiàn)的操作。使用clip函數(shù)可以很容易地找到變量范圍之外的異常值，并替換它們。

>>>?age.clip(50,?60)

18. at_time、between_time

在有時(shí)間粒度比較細(xì)的時(shí)候，這兩個(gè)函數(shù)超級(jí)有用。因?yàn)樗鼈兛梢赃M(jìn)行更細(xì)化的操作，比如篩選某個(gè)時(shí)點(diǎn)，或者某個(gè)范圍時(shí)間等，可以細(xì)化到小時(shí)分鐘。

>>>?data.at_time("15:00")

from?datetime?import?datetime

>>>?data.between_time("09:45",?"12:00")

19. hasnans

pandas提供了一種快速方法hasnans來(lái)檢查給定series是否包含空值。

series?=?pd.Series([2,?4,?6,?"sadf",?np.nan])

>>>?series.hasnans
True

該方法只適用于series的結(jié)構(gòu)。

20. GroupBy.nth

此功能僅適用于GroupBy對(duì)象。具體來(lái)說(shuō)，分組后，nth返回每組的第n行：

>>>?diamonds.groupby("cut").nth(5)

到此這篇關(guān)于分享20個(gè)Pandas短小精悍的數(shù)據(jù)操作的文章就介紹到這了,更多相關(guān)Pandas數(shù)據(jù)操作內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫(kù)

CMS

常用工具

分享20個(gè)Pandas短小精悍的數(shù)據(jù)操作

目錄

1. ExcelWriter

2. pipe

3. factorize

4. explode

5. squeeze

6. between

7. T

8. pandas styler

9. Pandas options

10. convert_dtypes

11. select_dtypes

12. mask

13. 列軸的min、max

14. nlargest、nsmallest

15. idmax、idxmin

16. value_counts

17. clip

18. at_time、between_time

19. hasnans

20. GroupBy.nth

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

分享20個(gè)Pandas短小精悍的數(shù)據(jù)操作

目錄

1. ExcelWriter

2. pipe

3. factorize

4. explode

5. squeeze

6. between

7. T

8. pandas styler

9. Pandas options

10. convert_dtypes

11. select_dtypes

12. mask

13. 列軸的min、max

14. nlargest、nsmallest

15. idmax、idxmin

16. value_counts

17. clip

18. at_time、between_time

19. hasnans

20. GroupBy.nth

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

13. 列軸的min、max

14. nlargest、nsmallest

15. idmax、idxmin

18. at_time、between_time