快捷導(dǎo)航

詳解Python如何利用Pandas與NumPy進(jìn)行數(shù)據(jù)清洗

更新時間：2022年04月13日 11:11:02 作者：Mr數(shù)據(jù)楊

許多數(shù)據(jù)科學(xué)家認(rèn)為獲取和清理數(shù)據(jù)的初始步驟占工作的 80%，花費大量時間來清理數(shù)據(jù)集并將它們歸結(jié)為可以使用的形式。本文將利用 Python 的 Pandas和 NumPy 庫來清理數(shù)據(jù)，需要的可以參考一下

準(zhǔn)備工作

導(dǎo)入模塊后就開始正式的數(shù)據(jù)預(yù)處理吧。

import pandas as pd
import numpy as np

DataFrame 列的刪除

通常會發(fā)現(xiàn)并非數(shù)據(jù)集中的所有數(shù)據(jù)類別都有用。例如可能有一個包含學(xué)生信息（姓名、年級、標(biāo)準(zhǔn)、父母姓名和地址）的數(shù)據(jù)集，但希望專注于分析學(xué)生成績。在這種情況下地址或父母的姓名并不重要。保留這些不需要的數(shù)據(jù)將占用不必要的空間。

BL-Flickr-Images-Book.csv 數(shù)據(jù)操作。

df = pd.read_csv('數(shù)據(jù)科學(xué)必備Pandas、NumPy進(jìn)行數(shù)據(jù)清洗/BL-Flickr-Images-Book.csv')
df.head()

可以看到這些列是對 Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks 沒有任何信息幫助的，因此可以進(jìn)行批量刪除處理。

to_drop_column = [ 'Edition Statement',
?? ??? ? ? ? ? ? ? 'Corporate Author',
?? ??? ? ? ? ? ? ? 'Corporate Contributors',
?? ??? ? ? ? ? ? ? 'Former owner',
?? ??? ? ? ? ? ? ? 'Engraver',
?? ??? ? ? ? ? ? ? 'Contributors',
?? ??? ? ? ? ? ? ? 'Issuance type',
?? ??? ? ? ? ? ? ? 'Shelfmarks']

df.drop(to_drop_column , inplace=True, axis=1)
df.head()

DataFrame 索引更改

Pandas 索引擴(kuò)展了 NumPy 數(shù)組的功能，以允許更通用的切片和標(biāo)記。在許多情況下，使用數(shù)據(jù)的唯一值標(biāo)識字段作為其索引是有幫助的。

獲取唯一標(biāo)識符。

df['Identifier'].is_unique
True

Identifier列替換索引列。

df = df.set_index('Identifier')
df.head()

206 是索引的第一個標(biāo)簽，可以使用 df.iloc[0] 基于位置的索引訪問。

DataFrame 數(shù)據(jù)字段整理

清理特定列并將它們轉(zhuǎn)換為統(tǒng)一格式，以更好地理解數(shù)據(jù)集并強制保持一致性。

處理 Date of Publication 出版日期列，發(fā)現(xiàn)該數(shù)據(jù)列格式并不統(tǒng)一。

df.loc[1905:, 'Date of Publication'].head(10)

Identifier
1905           1888
1929    1839, 38-54
2836           1897
2854           1865
2956        1860-63
2957           1873
3017           1866
3131           1899
4598           1814
4884           1820
Name: Date of Publication, dtype: object

我們可以使用正則表達(dá)式的方式直接提取連續(xù)的4個數(shù)字即可。

extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head()

Identifier
206 ? ?1879
216 ? ?1868
218 ? ?1869
472 ? ?1851
480 ? ?1857
Name: Date of Publication, dtype: object

最后獲取數(shù)字字段列。

df['Date of Publication'] = pd.to_numeric(extr)

str 方法與 NumPy 結(jié)合清理列

df[‘Date of Publication’].str 。此屬性是一種在 Pandas 中訪問快速字符串操作的方法，這些操作在很大程度上模仿了對原生 Python 字符串或編譯的正則表達(dá)式的操作，例如 .split()、.replace() 和 .capitalize()。

要清理 Place of Publication 字段，我們可以將 Pandas 的 str 方法與 NumPy 的 np.where 函數(shù)結(jié)合起來，該函數(shù)基本上是 Excel 的 IF() 宏的矢量化形式。

np.where(condition, then, else)

在這里 condition 要么是一個類似數(shù)組的對象，要么是一個布爾掩碼。 then 是如果條件評估為 True 時使用的值，否則是要使用的值。

本質(zhì)上 .where() 獲取用于條件的對象中的每個元素，檢查該特定元素在條件上下文中的計算結(jié)果是否為 True，并返回一個包含 then 或 else 的 ndarray，具體取決于哪個適用。可以嵌套在復(fù)合 if-then 語句中，允許根據(jù)多個條件計算值.

處理 Place of Publication 出版地數(shù)據(jù)。

df['Place of Publication'].head(10)

Identifier
206 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?London
216 ? ? ? ? ? ? ? ?London; Virtue & Yorston
218 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?London
472 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?London
480 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?London
481 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?London
519 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?London
667 ? ? pp. 40. G. Bryan & Co: Oxford, 1898
874 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? London]
1143 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? London
Name: Place of Publication, dtype: object

使用包含的方式提取需要的數(shù)據(jù)信息。

pub = df['Place of Publication']
london = pub.str.contains('London')
london[:5]

Identifier
206 ? ?True
216 ? ?True
218 ? ?True
472 ? ?True
480 ? ?True
Name: Place of Publication, dtype: bool

也可以使用 np.where 處理。

df['Place of Publication'] = np.where(london, 'London',
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? pub.str.replace('-', ' ')))

Identifier
206 ? ? ? ? ? ? ? ? ? ? London
216 ? ? ? ? ? ? ? ? ? ? London
218 ? ? ? ? ? ? ? ? ? ? London
472 ? ? ? ? ? ? ? ? ? ? London
480 ? ? ? ? ? ? ? ? ? ? London
? ? ? ? ? ? ? ? ? ... ? ? ? ??
4158088 ? ? ? ? ? ? ? ? London
4158128 ? ? ? ? ? ? ? ? ?Derby
4159563 ? ? ? ? ? ? ? ? London
4159587 ? ?Newcastle upon Tyne
4160339 ? ? ? ? ? ? ? ? London
Name: Place of Publication, Length: 8287, dtype: object

apply 函數(shù)清理整個數(shù)據(jù)集

在某些情況下，將自定義函數(shù)應(yīng)用于 DataFrame 的每個單元格或元素。 Pandas.apply() 方法類似于內(nèi)置的 map() 函數(shù)，只是將函數(shù)應(yīng)用于 DataFrame 中的所有元素。

例如將數(shù)據(jù)的發(fā)布日期進(jìn)行處理成 xxxx 年的格式，就可以使用apply。

def clean_date(text):
    try:
        return str(int(text)) + "年"
    except:
        return text

df["new_date"] = df["Date of Publication"].apply(clean_date)
df["new_date"] 

Identifier
206        1879年
216        1868年
218        1869年
472        1851年
480        1857年
           ...  
4158088    1838年
4158128    1831年
4159563      NaN
4159587    1834年
4160339    1834年
Name: new_date, Length: 8287, dtype: object

DataFrame 跳過行

olympics_df = pd.read_csv('數(shù)據(jù)科學(xué)必備Pandas、NumPy進(jìn)行數(shù)據(jù)清洗/olympics.csv')
olympics_df.head()

可以在讀取數(shù)據(jù)時候添加參數(shù)跳過某些不要的行，比如索引 0 行。

olympics_df = pd.read_csv('數(shù)據(jù)科學(xué)必備Pandas、NumPy進(jìn)行數(shù)據(jù)清洗/olympics.csv',header=1)
olympics_df.head()

DataFrame 重命名列

new_names =  {'Unnamed: 0': 'Country',
              '? Summer': 'Summer Olympics',
               '01 !': 'Gold',
              '02 !': 'Silver',
              '03 !': 'Bronze',
              '? Winter': 'Winter Olympics',
              '01 !.1': 'Gold.1',
              '02 !.1': 'Silver.1',
              '03 !.1': 'Bronze.1',
              '? Games': '# Games',
              '01 !.2': 'Gold.2',
              '02 !.2': 'Silver.2',
              '03 !.2': 'Bronze.2'}

olympics_df.rename(columns=new_names, inplace=True)

olympics_df.head()