Python實(shí)現(xiàn)隨機(jī)森林回歸與各自變量重要性分析與排序

更新時(shí)間：2023年02月19日 10:10:24 作者：瘋狂學(xué)習(xí)GIS

這篇文章主要為大家詳細(xì)介紹了在Python環(huán)境中，實(shí)現(xiàn)隨機(jī)森林（Random Forest，RF）回歸與各自變量重要性分析與排序的過(guò)程，感興趣的小伙伴可以了解一下

1 代碼分段講解

1.1 模塊與數(shù)據(jù)準(zhǔn)備

首先，導(dǎo)入所需要的模塊。在這里，需要pydot與graphviz這兩個(gè)相對(duì)不太常用的模塊，即使我用了Anaconda，也需要單獨(dú)下載、安裝。具體下載與安裝，如果同樣是在用Anaconda，大家就參考Python pydot與graphviz庫(kù)在Anaconda環(huán)境的配置即可。

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor

接下來(lái)，我們將代碼接下來(lái)需要用的主要變量加以定義。這一部分大家先不用過(guò)于在意，瀏覽一下繼續(xù)向下看即可；待到對(duì)應(yīng)的變量需要運(yùn)用時(shí)我們自然會(huì)理解其具體含義。

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

接下來(lái)，我們需要導(dǎo)入輸入數(shù)據(jù)。

在這里需要注意，本文對(duì)以下兩個(gè)數(shù)據(jù)處理的流程并沒(méi)有詳細(xì)涉及與講解（因?yàn)樵趯懕疚臅r(shí)，我已經(jīng)做過(guò)了同一批數(shù)據(jù)的深度學(xué)習(xí)回歸，本文就直接用了當(dāng)時(shí)做深度學(xué)習(xí)時(shí)處理好的輸入數(shù)據(jù)，因此以下兩個(gè)數(shù)據(jù)處理的基本過(guò)程就沒(méi)有再涉及啦），大家直接查看下方所列出的其它幾篇博客即可。

初始數(shù)據(jù)劃分訓(xùn)練集與測(cè)試集
類別變量的獨(dú)熱編碼（One-hot Encoding）

針對(duì)上述兩個(gè)數(shù)據(jù)處理過(guò)程，首先，數(shù)據(jù)訓(xùn)練集與測(cè)試集的劃分在機(jī)器學(xué)習(xí)、深度學(xué)習(xí)中是不可或缺的作用，這一部分大家可以查看Python TensorFlow深度學(xué)習(xí)回歸代碼：DNNRegressor的2.4部分，或Python TensorFlow深度神經(jīng)網(wǎng)絡(luò)回歸：keras.Sequential的2.3部分；其次，關(guān)于類別變量的獨(dú)熱編碼，對(duì)于隨機(jī)森林等傳統(tǒng)機(jī)器學(xué)習(xí)方法而言可以說(shuō)同樣是非常重要的，這一部分大家可以查看Python實(shí)現(xiàn)類別變量的獨(dú)熱編碼（One-hot Encoding）。

在本文中，如前所述，我們直接將已經(jīng)存在.csv中，已經(jīng)劃分好訓(xùn)練集與測(cè)試集且已經(jīng)對(duì)類別變量做好了獨(dú)熱編碼之后的數(shù)據(jù)加以導(dǎo)入。在這里，我所導(dǎo)入的數(shù)據(jù)第一行是表頭，即每一列的名稱。關(guān)于.csv數(shù)據(jù)導(dǎo)入的代碼詳解，大家可以查看多變量?jī)蓛上嗷リP(guān)系聯(lián)合分布圖的Python繪制的數(shù)據(jù)導(dǎo)入部分。

# Data import

'''
column_name=['EVI0610','EVI0626','EVI0712','EVI0728','EVI0813','EVI0829','EVI0914','EVI0930','EVI1016',
             'Lrad06','Lrad07','Lrad08','Lrad09','Lrad10',
             'Prec06','Prec07','Prec08','Prec09','Prec10',
             'Pres06','Pres07','Pres08','Pres09','Pres10',
             'SIF161','SIF177','SIF193','SIF209','SIF225','SIF241','SIF257','SIF273','SIF289',
             'Shum06','Shum07','Shum08','Shum09','Shum10',
             'Srad06','Srad07','Srad08','Srad09','Srad10',
             'Temp06','Temp07','Temp08','Temp09','Temp10',
             'Wind06','Wind07','Wind08','Wind09','Wind10',
             'Yield']
'''
train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

1.2 特征與標(biāo)簽分離

特征與標(biāo)簽，換句話說(shuō)其實(shí)就是自變量與因變量。我們要將訓(xùn)練集與測(cè)試集中對(duì)應(yīng)的特征與標(biāo)簽分別分離開來(lái)。

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

可以看到，直接借助drop就可以將標(biāo)簽'Yield'從原始的數(shù)據(jù)中剔除（同時(shí)還剔除了一個(gè)'ID'，這個(gè)是初始數(shù)據(jù)的樣本編號(hào)，后面就沒(méi)什么用了，因此隨著標(biāo)簽一起剔除）。同時(shí)在這里，還借助了train_X_column_name這一變量，將每一個(gè)特征值列所對(duì)應(yīng)的標(biāo)題（也就是特征的名稱）加以保存，供后續(xù)使用。

1.3 RF模型構(gòu)建、訓(xùn)練與預(yù)測(cè)

接下來(lái)，我們就需要對(duì)隨機(jī)森林模型加以建立，并訓(xùn)練模型，最后再利用測(cè)試集加以預(yù)測(cè)。在這里需要注意，關(guān)于隨機(jī)森林的幾個(gè)重要超參數(shù)（例如下方的n_estimators）都是需要不斷嘗試找到最優(yōu)的。關(guān)于這些超參數(shù)的尋優(yōu)，在MATLAB中的實(shí)現(xiàn)方法大家可以查看MATLAB實(shí)現(xiàn)隨機(jī)森林（RF）回歸與自變量影響程度分析的1.1部分；而在Python中的實(shí)現(xiàn)方法，我們將在下一篇博客中介紹。

# Build RF regression model

random_forest_model=RandomForestRegressor(n_estimators=200,random_state=random_forest_seed)
random_forest_model.fit(train_X,train_Y)

# Predict test set data

random_forest_predict=random_forest_model.predict(test_X)
random_forest_error=random_forest_predict-test_Y

其中，利用RandomForestRegressor進(jìn)行模型的構(gòu)建，n_estimators就是樹的個(gè)數(shù)，random_state是每一個(gè)樹利用Bagging策略中的Bootstrap進(jìn)行抽樣（即有放回的袋外隨機(jī)抽樣）時(shí)，隨機(jī)選取樣本的隨機(jī)數(shù)種子；fit進(jìn)行模型的訓(xùn)練，predict進(jìn)行模型的預(yù)測(cè)，最后一句就是計(jì)算預(yù)測(cè)的誤差。

1.4 預(yù)測(cè)圖像繪制、精度衡量指標(biāo)計(jì)算與保存

首先，進(jìn)行預(yù)測(cè)圖像繪制，其中包括預(yù)測(cè)結(jié)果的擬合圖與誤差分布直方圖。關(guān)于這一部分代碼的解釋，大家可以查看Python TensorFlow深度學(xué)習(xí)回歸代碼：DNNRegressor的2.9部分。

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

以上兩幅圖的繪圖結(jié)果如下所示。

接下來(lái)，進(jìn)行精度衡量指標(biāo)的計(jì)算與保存。在這里，我們用皮爾遜相關(guān)系數(shù)、決定系數(shù)與RMSE作為精度的衡量指標(biāo)，并將每一次模型運(yùn)行的精度衡量指標(biāo)結(jié)果保存在一個(gè)Excel文件中。這一部分大家同樣查看Python TensorFlow深度學(xué)習(xí)回歸代碼：DNNRegressor的2.9部分即可。

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)

1.5 決策樹可視化

這一部分我們借助DOT這一圖像描述語(yǔ)言，進(jìn)行隨機(jī)森林算法中決策樹的繪制。

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

其中，estimators_[5]是指整個(gè)隨機(jī)森林算法中的第6棵樹（下標(biāo)是從0開始的），換句話說(shuō)我們就是從很多的樹（具體樹的個(gè)數(shù)就是前面提到的超參數(shù)n_estimators）中抽取了找一個(gè)來(lái)畫圖，做一個(gè)示范。如下圖所示。

可以看到，單單是這一棵樹就已經(jīng)非常非常龐大了。我們將上圖其中最頂端（也就是最上方的節(jié)點(diǎn)——根節(jié)點(diǎn)）部分放大，就可以看見每一個(gè)節(jié)點(diǎn)對(duì)應(yīng)的信息。如下圖

在這里提一句，上圖根節(jié)點(diǎn)中有一個(gè)samples=151，但是我的樣本總數(shù)是315個(gè)，為什么這棵樹的樣本個(gè)數(shù)不是全部的樣本個(gè)數(shù)呢？

其實(shí)這就是隨機(jī)森林的內(nèi)涵所在：隨機(jī)森林的每一棵樹的輸入數(shù)據(jù)（也就是該棵樹的根節(jié)點(diǎn)中的數(shù)據(jù)），都是隨機(jī)選取的（也就是上面我們說(shuō)的利用Bagging策略中的Bootstrap進(jìn)行隨機(jī)抽樣），最后再將每一棵樹的結(jié)果聚合起來(lái)（聚合這個(gè)過(guò)程就是Aggregation，我們常說(shuō)的Bagging其實(shí)就是Bootstrap與Aggregation的合稱），形成隨機(jī)森林算法最終的結(jié)果。

1.6 變量重要性分析

在這里，我們進(jìn)行變量重要性的分析，并以圖的形式進(jìn)行可視化。

# Calculate the importance of variables

random_forest_importance=list(random_forest_model.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

得到圖像如下所示。這里是由于我的特征數(shù)量（自變量數(shù)量）過(guò)多，大概有150多個(gè)，導(dǎo)致橫坐標(biāo)的標(biāo)簽（也就是自變量的名稱）都重疊了；大家一般的自變量個(gè)數(shù)都不會(huì)太多，就不會(huì)有問(wèn)題~

以上就是全部的代碼分段介紹~

2 完整代碼

# -*- coding: utf-8 -*-
"""
Created on Sun Mar 21 22:05:37 2021

@author: fkxxgis
"""

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor


# Attention! Data Partition
# Attention! One-Hot Encoding

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

# Data import

'''
column_name=['EVI0610','EVI0626','EVI0712','EVI0728','EVI0813','EVI0829','EVI0914','EVI0930','EVI1016',
             'Lrad06','Lrad07','Lrad08','Lrad09','Lrad10',
             'Prec06','Prec07','Prec08','Prec09','Prec10',
             'Pres06','Pres07','Pres08','Pres09','Pres10',
             'SIF161','SIF177','SIF193','SIF209','SIF225','SIF241','SIF257','SIF273','SIF289',
             'Shum06','Shum07','Shum08','Shum09','Shum10',
             'Srad06','Srad07','Srad08','Srad09','Srad10',
             'Temp06','Temp07','Temp08','Temp09','Temp10',
             'Wind06','Wind07','Wind08','Wind09','Wind10',
             'Yield']
'''
train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

# Build RF regression model

random_forest_model=RandomForestRegressor(n_estimators=200,random_state=random_forest_seed)
random_forest_model.fit(train_X,train_Y)

# Predict test set data

random_forest_predict=random_forest_model.predict(test_X)
random_forest_error=random_forest_predict-test_Y

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

# Calculate the importance of variables

random_forest_importance=list(random_forest_model.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

以上就是Python實(shí)現(xiàn)隨機(jī)森林回歸與各自變量重要性分析與排序的詳細(xì)內(nèi)容，更多關(guān)于Python隨機(jī)森林的資料請(qǐng)關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: