機器學(xué)習(xí)之數(shù)據(jù)清洗及六種缺值處理方式小結(jié)
1 數(shù)據(jù)清洗
1.1 概念
數(shù)據(jù)清洗(Data Cleaning)是指從數(shù)據(jù)集中識別并糾正或刪除不準確、不完整、格式不統(tǒng)一或與業(yè)務(wù)規(guī)則不符的數(shù)據(jù)的過程。這個過程是數(shù)據(jù)預(yù)處理的一個重要組成部分,其目的是提高數(shù)據(jù)的質(zhì)量,確保數(shù)據(jù)的一致性和準確性,從而為數(shù)據(jù)分析、數(shù)據(jù)挖掘和機器學(xué)習(xí)等后續(xù)數(shù)據(jù)處理工作提供可靠的基礎(chǔ)。數(shù)據(jù)清洗是一個反復(fù)迭代的過程,可能需要多次調(diào)整和優(yōu)化以達到理想的效果。
1.2 重要性
數(shù)據(jù)清洗是數(shù)據(jù)處理過程中的重要環(huán)節(jié),它涉及到將原始數(shù)據(jù)轉(zhuǎn)換為可用、可靠和有意義的形式,以便進行進一步的分析和挖掘。
數(shù)據(jù)清洗是數(shù)據(jù)科學(xué)和數(shù)據(jù)分析領(lǐng)域的一個重要步驟,因為它直接影響到后續(xù)分析結(jié)果的準確性和可靠性。不干凈的數(shù)據(jù)可能會導(dǎo)致錯誤的結(jié)論和決策。
1.3 注意事項
- 1.完整性:檢查單條數(shù)據(jù)是否存在空值,統(tǒng)計的字段是否完善。
- 2.全面性:觀察某一列的全部數(shù)值,可以通過比較最大值、最小值、平均值、數(shù)據(jù)定義等來判斷數(shù)據(jù)是否全面。
- 3.合法性:檢査數(shù)值的類型、內(nèi)容、大小是否符合預(yù)設(shè)的規(guī)則。例如,人類的年齡超過1000歲這個數(shù)據(jù)就是不合法的。
- 4.唯一性:檢查數(shù)據(jù)是否重復(fù)記錄,例如一個人的數(shù)據(jù)被重復(fù)記錄多次。
- 5.類別是否可靠。
2 查轉(zhuǎn)空值及數(shù)據(jù)類型轉(zhuǎn)換和標準化
2.1 空值為空
- null_num = data.isnull()判斷是否為空,為空填充為True
- null_all = null_num.sum()計算空值數(shù)量
2.2 空值不為空
- data.replace(‘NA’, ‘’, inplace=True)空值為NA填充,替換為空再計算
null_num = data.isnull()
null_all = null_num.sum()
2.2.1 結(jié)果
調(diào)試結(jié)果:
null_num = data.isnull()
null_all
處理后原數(shù)據(jù)
2.3 類型轉(zhuǎn)換和標準化處理
- 特征數(shù)據(jù)類型轉(zhuǎn)換為數(shù)值
pd.to_numeric(數(shù)據(jù),errors=‘coerce’) - 標準化處理
scaler = StandardScaler()x_all_z = scaler.fit_transform(x_all)
調(diào)試結(jié)果:
x_all_z
3 六種缺值處理方式
3.1 數(shù)據(jù)介紹
部分數(shù)據(jù)展示,第一列為序號,最后一行為結(jié)果類別,其他為特征變量
3.2 涉及函數(shù)導(dǎo)入及變量
import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.ensemble import RandomForestRegressor
train_data,train_label,test_data,test_label分別為訓(xùn)練集變量、結(jié)果和測試集變量、結(jié)果,結(jié)果列名為質(zhì)量評分
3.3 刪除空行填充
# 刪除空行 def cca_train_fill(train_data,train_label): data = pd.concat([train_data, train_label], axis=1) #reset_index()重新排序 data = data.reset_index(drop=True) #dropna()刪除空行 df_filled = data.dropna() #dropna()刪除空行 df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型 def cca_test_fill(train_data,train_label,test_data,test_label): data = pd.concat([test_data, test_label], axis=1) data = data.reset_index(drop=True) df_filled = data.dropna() df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1), df_filled.礦物類型
3.4 平均值填充
# 平均數(shù) # 填充訓(xùn)練集平均值 def mean_train_method(data): # 數(shù)據(jù)的平均值 fill_values = data.mean() # fillna(平均值)填充平均值 # 返回數(shù)據(jù)填充后結(jié)果 return data.fillna(fill_values) def mean_train_fill(train_data,train_label): data = pd.concat([train_data,train_label],axis=1) data = data.reset_index(drop=True) A = data[data['礦物類型'] == 0] B = data[data['礦物類型'] == 1] C = data[data['礦物類型'] == 2] D = data[data['礦物類型'] == 3] A = mean_train_method(A) B = mean_train_method(B) C = mean_train_method(C) D = mean_train_method(D) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型 # 填充測試集平均數(shù),測試集需根據(jù)訓(xùn)練集的平均值進行填充 def mean_test_method(train_data,test_data): fill_values = train_data.mean() return test_data.fillna(fill_values) def mean_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data,test_label],axis=1) test_data_all = test_data_all.reset_index(drop=True) A_train = train_data_all[train_data_all['礦物類型'] == 0] B_train = train_data_all[train_data_all['礦物類型'] == 1] C_train = train_data_all[train_data_all['礦物類型'] == 2] D_train = train_data_all[train_data_all['礦物類型'] == 3] A_test = test_data_all[test_data_all['礦物類型'] == 0] B_test = test_data_all[test_data_all['礦物類型'] == 1] C_test = test_data_all[test_data_all['礦物類型'] == 2] D_test = test_data_all[test_data_all['礦物類型'] == 3] # 測試集根據(jù)訓(xùn)練集填充 A = mean_test_method(A_train,A_test) B = mean_test_method(B_train,B_test) C = mean_test_method(C_train,C_test) D = mean_test_method(D_train,D_test) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
3.5 中位數(shù)填充
# 中位數(shù) def median_train_method(data): fill_values = data.median() return data.fillna(fill_values) def median_train_fill(train_data,train_label): data = pd.concat([train_data,train_label],axis=1) data = data.reset_index(drop=True) A = data[data['礦物類型'] == 0] B = data[data['礦物類型'] == 1] C = data[data['礦物類型'] == 2] D = data[data['礦物類型'] == 3] A = median_train_method(A) B = median_train_method(B) C = median_train_method(C) D = median_train_method(D) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型 def median_test_method(train_data,test_data): fill_values = train_data.median() return test_data.fillna(fill_values) def median_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data,test_label],axis=1) test_data_all = test_data_all.reset_index(drop=True) A_train = train_data_all[train_data_all['礦物類型'] == 0] B_train = train_data_all[train_data_all['礦物類型'] == 1] C_train = train_data_all[train_data_all['礦物類型'] == 2] D_train = train_data_all[train_data_all['礦物類型'] == 3] A_test = test_data_all[test_data_all['礦物類型'] == 0] B_test = test_data_all[test_data_all['礦物類型'] == 1] C_test = test_data_all[test_data_all['礦物類型'] == 2] D_test = test_data_all[test_data_all['礦物類型'] == 3] A = median_test_method(A_train,A_test) B = median_test_method(B_train,B_test) C = median_test_method(C_train,C_test) D = median_test_method(D_train,D_test) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
3.6 眾數(shù)填充
# 眾數(shù) def mode_train_method(data): # apply()每列應(yīng)用函數(shù) # 執(zhí)行函數(shù)如果有眾數(shù)個數(shù)不為0或空,填充第一個 fill_values = data.apply(lambda x:x.mode().iloc[0] if len(x.mode())>0 else None) # 每列眾數(shù) a = data.mode() return data.fillna(fill_values) def mode_train_fill(train_data,train_label): data = pd.concat([train_data,train_label],axis=1) data = data.reset_index(drop=True) A = data[data['礦物類型'] == 0] B = data[data['礦物類型'] == 1] C = data[data['礦物類型'] == 2] D = data[data['礦物類型'] == 3] A = mode_train_method(A) B = mode_train_method(B) C = mode_train_method(C) D = mode_train_method(D) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型 def mode_test_method(train_data,test_data): fill_values = train_data.apply(lambda x:x.mode().iloc[0] if len(x.mode())>0 else None) return test_data.fillna(fill_values) def mode_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data,test_label],axis=1) test_data_all = test_data_all.reset_index(drop=True) A_train = train_data_all[train_data_all['礦物類型'] == 0] B_train = train_data_all[train_data_all['礦物類型'] == 1] C_train = train_data_all[train_data_all['礦物類型'] == 2] D_train = train_data_all[train_data_all['礦物類型'] == 3] A_test = test_data_all[test_data_all['礦物類型'] == 0] B_test = test_data_all[test_data_all['礦物類型'] == 1] C_test = test_data_all[test_data_all['礦物類型'] == 2] D_test = test_data_all[test_data_all['礦物類型'] == 3] A = mode_test_method(A_train,A_test) B = mode_test_method(B_train,B_test) C = mode_test_method(C_train,C_test) D = mode_test_method(D_train,D_test) df_filled = pd.concat([A,B,C,D]) df_filled = df_filled.reset_index(drop=True) return df_filled.drop('礦物類型', axis=1),df_filled.礦物類型
3.7 線性填充
def lr_train_fill(train_data,train_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('礦物類型',axis=1) # 計算空值個數(shù) null_num = train_data_x.isnull().sum() # 根據(jù)空值個數(shù)排列列名 null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) # 該列空值個數(shù)不為0 if null_num_sorted[i] != 0: # x為去除當前含空列的其他列特征數(shù)據(jù) x = train_data_x[filling_feature].drop(i,axis=1) # y為含空列所有數(shù)據(jù) y = train_data_x[i] # 空列行索引列表 row_numbers_null_list = train_data_x[train_data_x[i].isnull()].index.tolist() # 訓(xùn)練集x為去除空行的x x_train = x.drop(row_numbers_null_list) # 訓(xùn)練集y為去除空行的y y_train = y.drop(row_numbers_null_list) # 測試集空行的x數(shù)據(jù) x_test = x.iloc[row_numbers_null_list] lr = LinearRegression() lr.fit(x_train,y_train) # 預(yù)測空值結(jié)果 y_pr = lr.predict(x_test) train_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成訓(xùn)練數(shù)據(jù)集中的{i}列數(shù)據(jù)清洗') return train_data_x,train_data_all.礦物類型 def lr_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data, test_label], axis=1) test_data_all = test_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('礦物類型',axis=1) test_data_x = test_data_all.drop('礦物類型',axis=1) null_num = test_data_x.isnull().sum() null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) if null_num_sorted[i] != 0: x_train = train_data_x[filling_feature].drop(i,axis=1) y_train = train_data_x[i] x_test = test_data_x[filling_feature].drop(i,axis=1) row_numbers_null_list = test_data_x[test_data_x[i].isnull()].index.tolist() x_test = x_test.iloc[row_numbers_null_list] lr = LinearRegression() # 根據(jù)訓(xùn)練集數(shù)據(jù)進行測試集數(shù)據(jù)空值填充 lr.fit(x_train,y_train) y_pr = lr.predict(x_test) test_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成測試數(shù)據(jù)集中的{i}列數(shù)據(jù)清洗') return test_data_x,test_data_all.礦物類型
3.8 隨機森林填充
# 隨機森林 def Random_train_fill(train_data,train_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('礦物類型',axis=1) null_num = train_data_x.isnull().sum() null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) if null_num_sorted[i] != 0: x = train_data_x[filling_feature].drop(i,axis=1) y = train_data_x[i] row_numbers_null_list = train_data_x[train_data_x[i].isnull()].index.tolist() x_train = x.drop(row_numbers_null_list) y_train = y.drop(row_numbers_null_list) x_test = x.iloc[row_numbers_null_list] lr = RandomForestRegressor(n_estimators=100,max_features=0.8,random_state=314,n_jobs=-1) lr.fit(x_train,y_train) y_pr = lr.predict(x_test) train_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成訓(xùn)練數(shù)據(jù)集中的{i}列數(shù)據(jù)清洗') return train_data_x,train_data_all.礦物類型 def Random_test_fill(train_data,train_label,test_data,test_label): train_data_all = pd.concat([train_data,train_label],axis=1) train_data_all = train_data_all.reset_index(drop=True) test_data_all = pd.concat([test_data, test_label], axis=1) test_data_all = test_data_all.reset_index(drop=True) train_data_x = train_data_all.drop('礦物類型',axis=1) test_data_x = test_data_all.drop('礦物類型',axis=1) null_num = test_data_x.isnull().sum() null_num_sorted = null_num.sort_values(ascending=True) filling_feature = [] for i in null_num_sorted.index: filling_feature.append(i) if null_num_sorted[i] != 0: x_train = train_data_x[filling_feature].drop(i,axis=1) y_train = train_data_x[i] x_test = test_data_x[filling_feature].drop(i,axis=1) row_numbers_null_list = test_data_x[test_data_x[i].isnull()].index.tolist() x_test = x_test.iloc[row_numbers_null_list] lr = RandomForestRegressor(n_estimators=100,max_features=0.8,random_state=314,n_jobs=-1) lr.fit(x_train,y_train) y_pr = lr.predict(x_test) test_data_x.loc[row_numbers_null_list,i] = y_pr print(f'完成測試數(shù)據(jù)集中的{i}列數(shù)據(jù)清洗') return test_data_x,test_data_all.礦物類型
4 數(shù)據(jù)保存
不同處理方法得到的數(shù)據(jù)應(yīng)分別保存,更改[ ]內(nèi)內(nèi)容即可
代碼展示:
data_train = pd.concat([ov_x_train,ov_y_train],axis=1).sample(frac=1,random_state=4) data_test = pd.concat([x_test_fill,y_test_fill],axis=1).sample(frac=1,random_state=4) data_train.to_excel(r'./data_train_test//訓(xùn)練數(shù)據(jù)集[隨機森林回歸].xlsx',index=False) data_test.to_excel(r'./data_train_test//測試數(shù)據(jù)集[隨機森林回歸].xlsx',index=False)
5 代碼集合測試
為便于處理,將數(shù)據(jù)填充另封裝為file_data,便于應(yīng)用
全部代碼展示:
import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split import file_data from imblearn.over_sampling import SMOTE data = pd.read_excel('礦物數(shù)據(jù).xls') data = data[data['礦物類型'] != 'E'] # 空值為nan null_num = data.isnull() # 計算空值數(shù)量 null_all = null_num.sum() x_all = data.drop('礦物類型',axis=1).drop('序號',axis=1) y_all = data.礦物類型 # 轉(zhuǎn)換結(jié)果類別類型 label_dict = {'A':0,'B':1,'C':2,'D':3} encod_labels = [label_dict[label] for label in y_all] # 類別轉(zhuǎn)serises y_all = pd.Series(encod_labels,name='礦物類型') # 特征數(shù)據(jù)類型轉(zhuǎn)換為數(shù)值 for column_name in x_all.columns: x_all[column_name] = pd.to_numeric(x_all[column_name],errors='coerce') # 標準化處理 scaler = StandardScaler() x_all_z = scaler.fit_transform(x_all) x_all = pd.DataFrame(x_all_z,columns=x_all.columns) x_train,x_test,y_train,y_test = \ train_test_split(x_all,y_all,test_size=0.3,random_state=50000) ### 按注釋依次使用不同填充缺值數(shù)據(jù) # cca # x_train_fill,y_train_fill = file_data.cca_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.cca_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 平均值 # x_train_fill,y_train_fill = file_data.mean_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.mean_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 中位數(shù) # x_train_fill,y_train_fill = file_data.median_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.median_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 眾數(shù) # x_train_fill,y_train_fill = file_data.mode_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.mode_test_fill(x_train_fill,y_train_fill,x_test,y_test) # # lr_train_fill線性回歸 # x_train_fill,y_train_fill = file_data.lr_train_fill(x_train,y_train) # x_test_fill,y_test_fill = file_data.lr_test_fill(x_train_fill,y_train_fill,x_test,y_test) # 隨機森林回歸 x_train_fill,y_train_fill = file_data.Random_train_fill(x_train,y_train) x_test_fill,y_test_fill = file_data.Random_test_fill(x_train_fill,y_train_fill,x_test,y_test) #打亂順序 oversampler = SMOTE(k_neighbors=1,random_state=42) ov_x_train,ov_y_train = oversampler.fit_resample(x_train_fill,y_train_fill) # 數(shù)據(jù)存儲 data_train = pd.concat([ov_x_train,ov_y_train],axis=1).sample(frac=1,random_state=4) data_test = pd.concat([x_test_fill,y_test_fill],axis=1).sample(frac=1,random_state=4) data_train.to_excel(r'./data_train_test//訓(xùn)練數(shù)據(jù)集[隨機森林回歸].xlsx',index=False) data_test.to_excel(r'./data_train_test//測試數(shù)據(jù)集[隨機森林回歸].xlsx',index=False)
依次運行結(jié)果:
到此這篇關(guān)于機器學(xué)習(xí)之數(shù)據(jù)清洗及六種缺值處理方式小結(jié)的文章就介紹到這了,更多相關(guān)機器學(xué)習(xí)數(shù)據(jù)清洗缺值內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
Python重復(fù)文件批量整理工具的設(shè)計與實現(xiàn)
這篇文章主要為大家詳細介紹了如何通關(guān)Python編寫一個重復(fù)文件批量整理工具,可以在文件夾內(nèi)對文件進行去重和分類存儲,有需要的可以了解下2025-02-02TensorFLow 數(shù)學(xué)運算的示例代碼
這篇文章主要介紹了TensorFLow 數(shù)學(xué)運算的示例代碼,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2020-04-04Python中淺拷貝copy與深拷貝deepcopy的簡單理解
今天小編就為大家分享一篇關(guān)于Python中淺拷貝copy與深拷貝deepcopy的簡單理解,小編覺得內(nèi)容挺不錯的,現(xiàn)在分享給大家,具有很好的參考價值,需要的朋友一起跟隨小編來看看吧2018-10-10