快捷導(dǎo)航

pandas解決數(shù)據(jù)缺失、重復(fù)的方法與實(shí)踐過程

更新時(shí)間：2023年06月21日 14:21:49 投稿：jingxian

這篇文章主要介紹了pandas解決數(shù)據(jù)缺失、重復(fù)的方法與實(shí)踐過程，具有很好的參考價(jià)值，希望對大家有所幫助。如有錯誤或未考慮完全的地方，望不吝賜教

1. 數(shù)據(jù)缺失

常見的數(shù)據(jù)缺失是指一條數(shù)據(jù)記錄中，某個(gè)數(shù)據(jù)項(xiàng)沒有值，延申到實(shí)際應(yīng)用中，還有一種時(shí)間序列的缺失，例如按整點(diǎn)采集數(shù)據(jù)，缺少某一時(shí)刻的數(shù)據(jù)（缺少一整行數(shù)據(jù)）。

解決方法，如果不刪除數(shù)據(jù)，一般進(jìn)行插值處理，常見的補(bǔ)0，或者某個(gè)經(jīng)驗(yàn)值，更科學(xué)的方法是線性插值，或者更復(fù)雜的算法。

1.1. 時(shí)間序列補(bǔ)充

例如，給定某個(gè)時(shí)間序列（逐時(shí)），中間缺少3點(diǎn)數(shù)據(jù)，插值補(bǔ)充，并擴(kuò)充數(shù)據(jù)到間隔半個(gè)小時(shí)。

代碼 1.

import pandas as pd
key = ['getTime','temp','text','humidity']
data = [['2023-04-30T00:00',7,'晴',57],
        ['2023-04-30T01:00',6,'晴',58],
        ['2023-04-30T02:00',6,'陰',55],
        ['2023-04-30T04:00',4,'晴',50]]
df = pd.DataFrame(data,columns=key)
df.index = df['getTime'].astype('datetime64')

補(bǔ)充時(shí)間序列，同時(shí)對數(shù)值列進(jìn)行線性插值。

代碼 2.

df1 = df.resample('30min').interpolate(method='linear')

注意：補(bǔ)充時(shí)間序列，需要DataFrame中的index是時(shí)間序列。

或者，新建時(shí)間序列表，再通過pd.merge關(guān)聯(lián)補(bǔ)足缺失時(shí)序。

代碼 3.

times = pd.date_range('2023-04-30 00:00', '2023-04-30 04:59', freq='1h') # 與上文采用標(biāo)準(zhǔn)國際時(shí)間 UTC
# times = pd.date_range('2023-04-30 00:00', '2023-04-30 04:59', freq='1h', tz='Asia/Shanghai')
df0 = pd.DataFrame(index=times)
df0 = pd.merge(left=df0,right=df,left_index=True,right_index=True,how='left')

1.2. 數(shù)據(jù)項(xiàng)缺失

1.2.1. 線性插值

代碼 4.

df0[['temp','humidity']] = df0[['temp','humidity']].interpolate(method='linear')

或者，直接在時(shí)間序列補(bǔ)充時(shí)，線性插值，詳見代碼 2。

1.2.2. 復(fù)制上一條數(shù)據(jù)

如果是非數(shù)值型數(shù)據(jù)，可以采用復(fù)制上一條數(shù)據(jù)內(nèi)容，同理，數(shù)值型也滿足。

代碼 5.

df0[['text','getTime']] = df0[['text','getTime']].fillna(method='ffill')

1.2.3. 空值填充

例如，針對代碼 3的結(jié)果進(jìn)行填充空值“6”。

代碼 6.

df0.fillna(6, inplace=True)

2. 刪除重復(fù)數(shù)據(jù)行

首先，構(gòu)建重復(fù)數(shù)據(jù)，合并同一張表。

代碼 7.

# 合并同表前兩條記錄
df2 = pd.concat([df,df.head(2)])

其中，head(2)是取表中前兩條記錄。

2.1. 刪除完全重復(fù)的行

刪除重復(fù)記錄。

代碼 8.

df2 = df2.drop_duplicates()

注意：這個(gè)是刪除完全相同的數(shù)據(jù)。

2.2. 刪除重復(fù)數(shù)據(jù)項(xiàng)

按某列（可以多個(gè)）進(jìn)行去重，對于重復(fù)項(xiàng)，保留第一次出現(xiàn)的值。

代碼 9.

df2 = df2.drop_duplicates('text',keep='first')

df.drop_duplicates(subset=[‘A',‘B',‘C'],keep=‘first',inplace=True)

參數(shù)說明如下：

subset：表示要進(jìn)去重的列名，默認(rèn)為 None。
keep：有三個(gè)可選參數(shù)，分別是 first、last、False，默認(rèn)為 first，表示只保留第一次出現(xiàn)的重復(fù)項(xiàng)，刪除其余重復(fù)項(xiàng)，last 表示只保留最后一次出現(xiàn)的重復(fù)項(xiàng)，F(xiàn)alse 則表示刪除所有重復(fù)項(xiàng)。
inplace：布爾值參數(shù)，默認(rèn)為 False 表示刪除重復(fù)項(xiàng)后返回一個(gè)副本，若為 Ture 則表示直接在原數(shù)據(jù)上刪除重復(fù)項(xiàng)。

3. 按條件修改數(shù)據(jù)

按條件修改部分?jǐn)?shù)據(jù)值，常用方法是apply()調(diào)用函數(shù)處理，也可以直接使用loc定位索引進(jìn)行修改數(shù)據(jù)，引用代碼 1產(chǎn)生的結(jié)果。

本文采用loc方式，按條件修改數(shù)據(jù)。

代碼 10.

df.loc[df.loc[(df.index>=pd.to_datetime('2023-04-30 01:00')) ].index,
        ['temp','humidity']] = df[['temp','humidity']].loc[(df.index>=pd.to_datetime('2023-04-30 01:00')) ]+10

按索引，具體列為查詢條件都可以。

4. 發(fā)現(xiàn)空值及處理

4.1. 空值查詢

查詢出空值，并替換同行數(shù)據(jù)中的另一項(xiàng)數(shù)據(jù)，例如：查詢代碼 2的結(jié)果集，查詢“text”為空時(shí)的“temp”值，由“humidity”的值替換。

代碼 11.

df1.loc[df1[df1['text'].isnull()].index,'temp'] = df1['humidity'].loc[df1['text'].isnull()]

依據(jù)查詢代碼 2的結(jié)果集，查詢非空數(shù)據(jù)。

代碼 12.

df1 = df1.loc[~df1['text'].isnull()]

其中，~ 表示取反符號，.isnull() 方法用于判斷是否為空。

4.2. 刪除空值

依據(jù)查詢代碼 2的結(jié)果集，刪除空值的數(shù)據(jù)行。

代碼 13.

df1 = df1.dropna()

總結(jié)

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片