快捷導(dǎo)航

基于pandas數(shù)據(jù)清洗的實(shí)現(xiàn)示例

更新時(shí)間：2024年07月23日 08:26:27 作者：寫代碼的大學(xué)生

數(shù)據(jù)清洗是數(shù)據(jù)科學(xué)和數(shù)據(jù)分析中非常重要的一個(gè)步驟,本文主要介紹了基于pandas的數(shù)據(jù)清洗,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

數(shù)據(jù)清洗是數(shù)據(jù)科學(xué)和數(shù)據(jù)分析中非常重要的一個(gè)步驟。它指的是在數(shù)據(jù)分析之前，對(duì)數(shù)據(jù)進(jìn)行預(yù)處理，以確保數(shù)據(jù)的質(zhì)量和一致性。使用Python的pandas庫(kù)進(jìn)行數(shù)據(jù)清洗是一種常見(jiàn)的做法，因?yàn)閜andas提供了豐富的數(shù)據(jù)操作和清洗功能。

1.導(dǎo)入需要的庫(kù)

import pandas as pd
from pandas import DataFrame
import numpy as np

2.處理丟失數(shù)據(jù)

有兩種丟失數(shù)據(jù)：

None
np.nan(NaN）

為什么在數(shù)據(jù)分析中需要用到的是浮點(diǎn)類型的空而不是對(duì)象類型？

數(shù)據(jù)分析中會(huì)常常使用某些形式的運(yùn)算來(lái)處理原始數(shù)據(jù)，如果原數(shù)數(shù)據(jù)中的空值為NAN的形式，則不會(huì)干擾或者中斷運(yùn)算。
NAN可以參與運(yùn)算的
None是不可以參與運(yùn)算

df = DataFrame(data=np.random.randint(0,100,size=(7,5)))
df.iloc[2,3] = None
df.iloc[4,2] = np.nan
df.iloc[5,4] = None
df

運(yùn)行結(jié)果為：

3.pandas處理空值操作

isnull
notnull
any
all
dropna
filln

#哪些行中有空值
#any(axis=1)檢測(cè)哪些行中存有空值
df.isnull().any(axis=1) #any會(huì)作用isnull返回結(jié)果的每一行
#true對(duì)應(yīng)的行就是存有缺失數(shù)據(jù)的行

運(yùn)行結(jié)果：

df.notnull()
df.notnull().all(axis=1)
#將布爾值作為源數(shù)據(jù)的行索引
df.loc[df.notnull().all(axis=1)]
#獲取空對(duì)應(yīng)的行數(shù)據(jù)
df.loc[df.isnull().any(axis=1)]
#獲取空對(duì)應(yīng)行數(shù)據(jù)的行索引
indexs = df.loc[df.isnull().any(axis=1)].index
indexs
df.drop(labels=indexs,axis=0)

3.案例分析

數(shù)據(jù)說(shuō)明：

數(shù)據(jù)是1個(gè)冷庫(kù)的溫度數(shù)據(jù)，1-7對(duì)應(yīng)7個(gè)溫度采集設(shè)備，1分鐘采集一次。

數(shù)據(jù)處理目標(biāo)：

用1-4對(duì)應(yīng)的4個(gè)必須設(shè)備，通過(guò)建立冷庫(kù)的溫度場(chǎng)關(guān)系模型，預(yù)估出5-7對(duì)應(yīng)的數(shù)據(jù)。
最后每個(gè)冷庫(kù)中僅需放置4個(gè)設(shè)備，取代放置7個(gè)設(shè)備。
f(1-4) --> y(5-7)

數(shù)據(jù)處理過(guò)程：

1、原始數(shù)據(jù)中有丟幀現(xiàn)象，需要做預(yù)處理；
2、matplotlib 繪圖；
3、建立邏輯回歸模型。

無(wú)標(biāo)準(zhǔn)答案，按個(gè)人理解操作即可，請(qǐng)把自己的操作過(guò)程以文字形式簡(jiǎn)單描述一下，謝謝配合。

測(cè)試數(shù)據(jù)為testData.xlsx

data = pd.read_excel('./data/testData.xlsx').drop(labels=['none','none1'],axis=1)
data

運(yùn)行結(jié)果為：

data.shape
#刪除空對(duì)應(yīng)的行數(shù)據(jù)
data.dropna(axis=0).shape
df = DataFrame(data=np.random.randint(0,100,size=(8,6)))
df.iloc[1] = [1,1,1,1,1,1]
df.iloc[3] = [1,1,1,1,1,1]
df.iloc[5] = [1,1,1,1,1,1]
df
#檢測(cè)哪些行存有重復(fù)的數(shù)據(jù)
df.duplicated(keep='first')
df.loc[~df.duplicated(keep='first')]
#異步到位刪除
df.drop_duplicates(keep='first')
df = DataFrame(data=np.random.random(size=(1000,3)),columns=['A','B','C'])
df.head()
#制定判定異常值的條件
twice_std = df['C'].std() * 2
twice_std
df.loc[~(df['C'] > twice_std)]

運(yùn)行結(jié)果：

到此這篇關(guān)于基于pandas數(shù)據(jù)清洗的實(shí)現(xiàn)示例的文章就介紹到這了,更多相關(guān)pandas 數(shù)據(jù)清洗內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: