Pandas數(shù)據(jù)處理加速技巧匯總

更新時(shí)間：2022年04月18日 14:02:02 作者：Mr數(shù)據(jù)楊

Pandas?處理數(shù)據(jù)的效率還是很優(yōu)秀的，相對(duì)于大規(guī)模的數(shù)據(jù)集只要掌握好正確的方法，就能讓在數(shù)據(jù)處理時(shí)間上節(jié)省很多很多的時(shí)間。本文為大家匯總了一些Pandas數(shù)據(jù)處理加速技巧，需要的可以參考一下

數(shù)據(jù)準(zhǔn)備
日期時(shí)間數(shù)據(jù)優(yōu)化
數(shù)據(jù)的簡(jiǎn)單循環(huán)
循環(huán) .itertuples() 和 .iterrows() 方法
.apply() 方法
.isin() 數(shù)據(jù)選擇
.cut() 數(shù)據(jù)分箱
Numpy 方法處理
處理效率比較
HDFStore 防止重新處理

Pandas 處理數(shù)據(jù)的效率還是很優(yōu)秀的，相對(duì)于大規(guī)模的數(shù)據(jù)集只要掌握好正確的方法，就能讓在數(shù)據(jù)處理時(shí)間上節(jié)省很多很多的時(shí)間。

Pandas 是建立在 NumPy 數(shù)組結(jié)構(gòu)之上的，許多操作都是在 C 中執(zhí)行的，要么通過(guò) NumPy，要么通過(guò) Pandas 自己的 Python 擴(kuò)展模塊庫(kù)，這些模塊用 Cython 編寫并編譯為 C。理論上來(lái)說(shuō)處理速度應(yīng)該是很快的。

那么為什么同樣一份數(shù)據(jù)由2個(gè)人處理，在設(shè)備相同的情況下處理時(shí)間會(huì)出現(xiàn)天差地別呢？

需要明確的是，這不是關(guān)于如何過(guò)度優(yōu)化 Pandas 代碼的指南。如果使用得當(dāng) Pandas 已經(jīng)構(gòu)建為可以快速運(yùn)行。此外優(yōu)化和編寫干凈的代碼之間存在很大差異。

這是以 Python 方式使用 Pandas 以充分利用其強(qiáng)大且易于使用的內(nèi)置功能的指南。

數(shù)據(jù)準(zhǔn)備

此示例的目標(biāo)是應(yīng)用分時(shí)能源關(guān)稅來(lái)計(jì)算一年的能源消耗總成本。也就是說(shuō)，在一天中的不同時(shí)間，電價(jià)會(huì)有所不同，因此任務(wù)是將每小時(shí)消耗的電量乘以消耗該小時(shí)的正確價(jià)格。

從一個(gè)包含兩列的 CSV 文件中讀取數(shù)據(jù)，一列用于日期加時(shí)間，另一列用于以千瓦時(shí) (kWh) 為單位消耗的電能。

日期時(shí)間數(shù)據(jù)優(yōu)化

import pandas as pd
df = pd.read_csv('數(shù)據(jù)科學(xué)必備Pandas實(shí)操數(shù)據(jù)處理加速技巧匯總/demand_profile.csv')
df.head()
     date_time  energy_kwh
0  1/1/13 0:00       0.586
1  1/1/13 1:00       0.580
2  1/1/13 2:00       0.572
3  1/1/13 3:00       0.596
4  1/1/13 4:00       0.592

乍一看這看起來(lái)不錯(cuò)，但有一個(gè)小問(wèn)題。 Pandas 和 NumPy 有一個(gè) dtypes（數(shù)據(jù)類型）的概念。如果未指定任何參數(shù)，則 date_time 將采用 object dtype。

df.dtypes
date_time      object
energy_kwh    float64
dtype: object

type(df.iat[0, 0])
str

object 不僅是 str 的容器，而且是任何不能完全適合一種數(shù)據(jù)類型的列的容器。將日期作為字符串處理會(huì)既費(fèi)力又低效（這也會(huì)導(dǎo)致內(nèi)存效率低下）。為了處理時(shí)間序列數(shù)據(jù)，需要將 date_time 列格式化為日期時(shí)間對(duì)象數(shù)組（ Timestamp）。

df['date_time'] = pd.to_datetime(df['date_time'])
df['date_time'].dtype
datetime64[ns]

現(xiàn)在有一個(gè)名為 df 的 DataFrame，有兩列和一個(gè)用于引用行的數(shù)字索引。

df.head()
               date_time    energy_kwh
0    2013-01-01 00:00:00         0.586
1    2013-01-01 01:00:00         0.580
2    2013-01-01 02:00:00         0.572
3    2013-01-01 03:00:00         0.596
4    2013-01-01 04:00:00         0.592

使用 Jupyter 自帶的 %%time 計(jì)時(shí)裝飾器進(jìn)行測(cè)試。

def convert(df, column_name):
	return pd.to_datetime(df[column_name])

%%time
df['date_time'] = convert(df, 'date_time')

Wall time: 663 ms

def convert_with_format(df, column_name):
	return pd.to_datetime(df[column_name],format='%d/%m/%y %H:%M')

%%time
df['date_time'] = convert(df, 'date_time')

Wall time: 1.99 ms

處理效率提高將近350倍。如果在處理大規(guī)模數(shù)據(jù)的情況下，處理數(shù)據(jù)的時(shí)間會(huì)無(wú)限的放大。

數(shù)據(jù)的簡(jiǎn)單循環(huán)

既然日期和時(shí)間格式處理完畢，就可以著手計(jì)算電費(fèi)了。成本因小時(shí)而異，因此需要有條件地將成本因素應(yīng)用于一天中的每個(gè)小時(shí)。

在此示例中，使用時(shí)間成本將定義成三個(gè)部分。

data_type = {
    # 高峰
    "Peak":{"Cents per kWh":28,"Time Range":"17:00 to 24:00"},
    # 正常時(shí)段
    "Shoulder":{"Cents per kWh":20,"Time Range":"7:00 to 17:00"},
    # 非高峰
    "Off-Peak":{"Cents per kWh":12,"Time Range":"0:00 to 7:00"}, 
}

如果價(jià)格是一天中每小時(shí)每千瓦時(shí) 28 美分。

df['cost_cents'] = df['energy_kwh'] * 28

               date_time    energy_kwh       cost_cents
0    2013-01-01 00:00:00         0.586           16.408
1    2013-01-01 01:00:00         0.580           16.240
2    2013-01-01 02:00:00         0.572           16.016
3    2013-01-01 03:00:00         0.596           16.688
4    2013-01-01 04:00:00         0.592           16.576
...

但是成本計(jì)算取決于一天中的不同時(shí)間。這就是你會(huì)看到很多人以意想不到的方式使用 Pandas 的地方，通過(guò)編寫一個(gè)循環(huán)來(lái)進(jìn)行條件計(jì)算。

def apply_tariff(kwh, hour):
    """計(jì)算給定小時(shí)的電費(fèi)"""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'無(wú)效時(shí)間: {hour}')
    return rate * kwh

def apply_tariff(kwh, hour):
    """計(jì)算給定小時(shí)的電費(fèi)"""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'無(wú)效時(shí)間: {hour}')
    return rate * kwh

def apply_tariff_loop(df):
    energy_cost_list = []
    for i in range(len(df)):
    	# 循環(huán)數(shù)據(jù)直接修改df
        energy_used = df.iloc[i]['energy_kwh']
        hour = df.iloc[i]['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)

    df['cost_cents'] = energy_cost_list

Wall time: 2.59 s

循環(huán) .itertuples() 和 .iterrows() 方法

Pandas 實(shí)際上 for i in range(len(df)) 通過(guò)引入 DataFrame.itertuples() 和 DataFrame.iterrows() 方法使語(yǔ)法就可能顯得多余，這些都是yield一次一行的生成器方法。

.itertuples() 為每一行生成一個(gè)命名元組，行的索引值作為元組的第一個(gè)元素。名稱元組是來(lái)自 Python 集合模塊的數(shù)據(jù)結(jié)構(gòu)，其行為類似于 Python 元組，但具有可通過(guò)屬性查找訪問(wèn)的字段。

.iterrows() 為 DataFrame 中的每一行生成 (index, Series) 對(duì)（元組）。

def apply_tariff_iterrows(df):
    energy_cost_list = []
    for index, row in df.iterrows():
        energy_used = row['energy_kwh']
        hour = row['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df['cost_cents'] = energy_cost_list

%%time
apply_tariff_iterrows(df)

Wall time: 808 ms

速度提高又3倍之多。

.apply() 方法

可以使用 .apply() 方法進(jìn)一步改進(jìn)此操作。 Pandas 的 .apply() 方法采用函數(shù)（可調(diào)用對(duì)象）并將它們沿 DataFrame 的軸（所有行或所有列）應(yīng)用。

lambda 函數(shù)將兩列數(shù)據(jù)傳遞給 apply_tariff()。

def apply_tariff_withapply(df):
    df['cost_cents'] = df.apply(
        lambda row: apply_tariff(
            kwh=row['energy_kwh'],
            hour=row['date_time'].hour),
        axis=1)

%%time
apply_tariff_withapply(df)

Wall time: 181 ms

.apply() 的語(yǔ)法優(yōu)勢(shì)很明顯，代碼簡(jiǎn)潔、易讀、明確。在這種情況下所用時(shí)間大約是該 .iterrows() 方法的4分之一。

.isin() 數(shù)據(jù)選擇

但是如何在 Pandas 中將條件計(jì)算應(yīng)用為矢量化操作呢？一個(gè)技巧是根據(jù)的條件選擇和分組 DataFrame 的部分，然后對(duì)每個(gè)選定的組應(yīng)用矢量化操作。

使用 Pandas 的.isin()方法選擇行，然后在矢量化操作中應(yīng)用。在執(zhí)行此操作之前，如果將 date_time 列設(shè)置為 DataFrame 的索引會(huì)更方便。

df.set_index('date_time', inplace=True)

def apply_tariff_isin(df):
    peak_hours = df.index.hour.isin(range(17, 24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
    df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12

%%time
apply_tariff_isin(df)

Wall time: 53.5 ms

其中整個(gè)過(guò)程方法返回一個(gè)布爾列表。

[False, False, False, ..., True, True, True]

.cut() 數(shù)據(jù)分箱

設(shè)置時(shí)間切分的列表和對(duì)那個(gè)計(jì)算的函數(shù)公式，讓操作起來(lái)更簡(jiǎn)單，但是這個(gè)對(duì)于新手來(lái)說(shuō)代碼閱讀起來(lái)有些困難。

def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    df['cost_cents'] = cents_per_kwh * df['energy_kwh']
    
%%time
apply_tariff_cut(df)

Wall time: 2.99 ms

Numpy 方法處理

Pandas Series 和 DataFrames 是在 NumPy 庫(kù)之上設(shè)計(jì)的。這為提供了更大的計(jì)算靈活性，因?yàn)?Pandas 可以與 NumPy 數(shù)組和操作無(wú)縫協(xié)作。

使用 NumPy 的digitize()函數(shù)。它與 Pandas 的相似之處cut()在于數(shù)據(jù)將被分箱，但這次它將由一個(gè)索引數(shù)組表示，該數(shù)組表示每個(gè)小時(shí)屬于哪個(gè)箱。然后將這些索引應(yīng)用于價(jià)格數(shù)組。

import numpy as np

def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].values

%%time
apply_tariff_digitize(df)

Wall time: 1.99 ms

處理效率比較

對(duì)比一下上面幾種不同的處理方式的效率吧。

功能	運(yùn)行時(shí)間（秒）
apply_tariff_loop()	2.59 s
apply_tariff_iterrows()	808 ms
apply_tariff_withapply()	181 ms
apply_tariff_isin()	53.5 ms
apply_tariff_cut()	2.99 ms
apply_tariff_digitize()	1.99 ms

HDFStore 防止重新處理

通常構(gòu)建復(fù)雜的數(shù)據(jù)模型時(shí)，對(duì)數(shù)據(jù)進(jìn)行一些預(yù)處理會(huì)很方便。如果有 10 年的分鐘頻率用電量數(shù)據(jù)，即指定了格式參數(shù)簡(jiǎn)單地將日期和時(shí)間轉(zhuǎn)換為日期時(shí)間也可能需要 20 分鐘。只需要這樣做一次而不是每次運(yùn)行模型時(shí)都需要進(jìn)行測(cè)試或分析。

可以在這里做的一件非常有用的事情是預(yù)處理，然后以處理后的形式存儲(chǔ)數(shù)據(jù)，以便在需要時(shí)使用。但是如何才能以正確的格式存儲(chǔ)數(shù)據(jù)而無(wú)需再次重新處理呢？如果要保存為 CSV 只會(huì)丟失您的日期時(shí)間對(duì)象，并且在再次訪問(wèn)時(shí)必須重新處理它。

Pandas 有一個(gè)內(nèi)置的解決方案使用 HDF5，一種專為存儲(chǔ)表格數(shù)據(jù)數(shù)組而設(shè)計(jì)的高性能存儲(chǔ)格式。Pandas 的 HDFStore 類允許將 DataFrame 存儲(chǔ)在 HDF5 文件中，以便可以有效地訪問(wèn)它，同時(shí)仍保留列類型和其他元數(shù)據(jù)。dict 是一個(gè)類似字典的類，因此可以像對(duì) Python對(duì)象一樣進(jìn)行讀寫。

將預(yù)處理的耗電量 DataFrame 存儲(chǔ)df在 HDF5 文件中。

data_store = pd.HDFStore('processed_data.h5')

# 將 DataFrame 放入對(duì)象中，將鍵設(shè)置為 preprocessed_df 
data_store['preprocessed_df'] = df
data_store.close()

從 HDF5 文件訪問(wèn)數(shù)據(jù)的方法，并保留數(shù)據(jù)類型。

data_store = pd.HDFStore('processed_data.h5')

preprocessed_df = data_store['preprocessed_df']
data_store.close()

到此這篇關(guān)于Pandas數(shù)據(jù)處理加速技巧匯總的文章就介紹到這了,更多相關(guān)Pandas數(shù)據(jù)處理內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫(kù)

CMS

常用工具

Pandas數(shù)據(jù)處理加速技巧匯總

目錄

數(shù)據(jù)準(zhǔn)備

日期時(shí)間數(shù)據(jù)優(yōu)化

數(shù)據(jù)的簡(jiǎn)單循環(huán)

循環(huán) .itertuples() 和 .iterrows() 方法

.apply() 方法

.isin() 數(shù)據(jù)選擇

.cut() 數(shù)據(jù)分箱

Numpy 方法處理

處理效率比較

HDFStore 防止重新處理

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具