快捷導(dǎo)航

使用Python自動處理Excel數(shù)據(jù)缺失值的完整指南

更新時間：2025年04月28日 09:47:58 作者：東方佑

在數(shù)據(jù)分析中,數(shù)據(jù)清洗是至關(guān)重要的第一步,尤其在處理 Excel 文件時,缺失值常常會阻礙分析流程,本文將通過一個自動化腳本,展示如何使用 Python 的 pandas 和 scikit-learn 庫,快速填充或刪除 Excel 文件中的缺失值,并最終生成干凈的數(shù)據(jù)集,需要的朋友可以參考下

步驟 1：讀取 Excel 文件
步驟 2：分離數(shù)值型和類別型數(shù)據(jù)
步驟 3：填充數(shù)值型缺失值（均值填充）
步驟 4：填充類別型缺失值（眾數(shù)填充）
步驟 5：合并處理后的數(shù)據(jù)
步驟 6：保存清洗后的數(shù)據(jù)

4. 注意事項與擴展

注意事項
擴展功能

5. 完整代碼與示例

總結(jié)

1. 問題背景

在分析 Excel 文件時，缺失值可能以 NaN、空單元格或特殊符號（如 ?）的形式存在。手動處理這些缺失值耗時且容易出錯，因此需要自動化解決方案。例如，你可能遇到以下場景：

銷售數(shù)據(jù)：某個月份的銷售額未記錄。
用戶調(diào)研表：部分受訪者未填寫年齡或性別。
傳感器數(shù)據(jù)：設(shè)備故障導(dǎo)致部分時間點無記錄。

2. 核心工具與原理

工具選擇

pandas：Python 數(shù)據(jù)處理的標準庫，用于讀取 Excel 文件和數(shù)據(jù)操作。
scikit-learn：機器學(xué)習(xí)庫中的 SimpleImputer 模塊，提供缺失值填充的自動化方法。

填充策略

數(shù)值型數(shù)據(jù)：用列均值（mean）或中位數(shù)（median）填充。
類別型數(shù)據(jù)：用眾數(shù)（most_frequent）填充。
極端情況：若缺失值占比過高，可直接刪除該列或行。

3. 代碼實現(xiàn)步驟詳解

以下是基于你提供的代碼的完整實現(xiàn)流程：

步驟 1：讀取 Excel 文件

import pandas as pd

# 讀取 Excel 文件
df = pd.read_excel("mx-toys.xlsx")

步驟 2：分離數(shù)值型和類別型數(shù)據(jù)

# 分離數(shù)值型和非數(shù)值型列
numeric_cols = df.select_dtypes(include=['number']).columns
categorical_cols = df.select_dtypes(exclude=['number']).columns

步驟 3：填充數(shù)值型缺失值（均值填充）

from sklearn.impute import SimpleImputer

# 創(chuàng)建數(shù)值型填充器（均值策略）
numeric_imputer = SimpleImputer(strategy='mean')

# 填充數(shù)值列并轉(zhuǎn)換為 DataFrame
df_numeric = pd.DataFrame(
    numeric_imputer.fit_transform(df[numeric_cols]),
    columns=numeric_cols
)

步驟 4：填充類別型缺失值（眾數(shù)填充）

# 創(chuàng)建類別型填充器（眾數(shù)策略）
categorical_imputer = SimpleImputer(strategy='most_frequent')

# 填充類別列并轉(zhuǎn)換為 DataFrame
df_categorical = pd.DataFrame(
    categorical_imputer.fit_transform(df[categorical_cols]),
    columns=categorical_cols
)

步驟 5：合并處理后的數(shù)據(jù)

# 合并數(shù)值和類別數(shù)據(jù)
df_cleaned = pd.concat([df_numeric, df_categorical], axis=1)

步驟 6：保存清洗后的數(shù)據(jù)

# 保存為新的 Excel 文件
df_cleaned.to_excel("cleaned_mx-toys.xlsx", index=False)

4. 注意事項與擴展

注意事項

數(shù)據(jù)類型檢查：
- 確保 select_dtypes 正確分離數(shù)值和類別列（如 object 類型可能包含文本或日期，需額外處理）。
異常值檢測：
- 填充均值可能受異常值影響，可改用中位數(shù)（strategy='median'）。
刪除策略：
- 若某列缺失值過多（如超過 50%），可直接刪除：

df = df.dropna(thresh=len(df)*0.5, axis=1)

擴展功能

可視化缺失值分布：
使用 missingno 庫快速查看缺失值分布：

import missingno as msno
msno.matrix(df).show()

自定義填充邏輯：
- 對于時間序列數(shù)據(jù)，可使用插值法（interpolate()）。
- 對于類別型數(shù)據(jù)，可填充特定值（如 N/A）：

df_categorical.fillna("Unknown", inplace=True)

5. 完整代碼與示例

import pandas as pd
from sklearn.impute import SimpleImputer

def clean_excel_file(file_path, output_path):
    """
    自動處理 Excel 文件中的缺失值：
    1. 數(shù)值型列填充均值
    2. 類別型列填充眾數(shù)
    3. 保存清洗后的數(shù)據(jù)
    """
    # 讀取數(shù)據(jù)
    df = pd.read_excel(file_path)
    
    # 分離數(shù)值和類別列
    numeric_cols = df.select_dtypes(include=['number']).columns
    categorical_cols = df.select_dtypes(exclude=['number']).columns
    
    # 處理數(shù)值列
    numeric_imputer = SimpleImputer(strategy='mean')
    df_numeric = pd.DataFrame(
        numeric_imputer.fit_transform(df[numeric_cols]),
        columns=numeric_cols
    )
    
    # 處理類別列
    categorical_imputer = SimpleImputer(strategy='most_frequent')
    df_categorical = pd.DataFrame(
        categorical_imputer.fit_transform(df[categorical_cols]),
        columns=categorical_cols
    )
    
    # 合并數(shù)據(jù)并保存
    df_cleaned = pd.concat([df_numeric, df_categorical], axis=1)
    df_cleaned.to_excel(output_path, index=False)
    print(f"數(shù)據(jù)已清洗并保存至 {output_path}")

# 使用示例
clean_excel_file("mx-toys.xlsx", "cleaned_mx-toys.xlsx")