快捷導(dǎo)航

Python實現(xiàn)隨機(jī)森林RF模型超參數(shù)的優(yōu)化詳解

更新時間：2023年02月19日 10:26:33 作者：瘋狂學(xué)習(xí)GIS

這篇文章主要為大家詳細(xì)介紹了基于Python的隨機(jī)森林（Random Forest，RF）回歸代碼，以及模型超參數(shù)（包括決策樹個數(shù)與最大深度、最小分離樣本數(shù)、最小葉子節(jié)點樣本數(shù)、最大分離特征數(shù)等）自動優(yōu)化的代碼，感興趣的小伙伴可以了解一下

本文介紹基于Python的隨機(jī)森林（Random Forest，RF）回歸代碼，以及模型超參數(shù)（包括決策樹個數(shù)與最大深度、最小分離樣本數(shù)、最小葉子節(jié)點樣本數(shù)、最大分離特征數(shù)等）自動優(yōu)化的代碼。

本文是在上一篇文章Python實現(xiàn)隨機(jī)森林RF并對比自變量的重要性的基礎(chǔ)上完成的，因此本次僅對隨機(jī)森林模型超參數(shù)自動擇優(yōu)部分的代碼加以詳細(xì)解釋；而數(shù)據(jù)準(zhǔn)備、模型建立、精度評定等其他部分的代碼詳細(xì)解釋，大家直接點擊上述文章Python實現(xiàn)隨機(jī)森林RF并對比自變量的重要性查看即可。

其中，關(guān)于基于MATLAB實現(xiàn)同樣過程的代碼與實戰(zhàn)，大家可以點擊查看文章MATLAB實現(xiàn)隨機(jī)森林（RF）回歸與自變量影響程度分析。

本文分為兩部分，第一部分為代碼的分段講解，第二部分為完整代碼。

1 代碼分段講解

1.1 數(shù)據(jù)與模型準(zhǔn)備

本部分是對隨機(jī)森林算法的數(shù)據(jù)與模型準(zhǔn)備，由于在之前的博客中已經(jīng)詳細(xì)介紹過了，本文就不再贅述~大家直接查看文章Python實現(xiàn)隨機(jī)森林RF并對比自變量的重要性即可。

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Attention! Data Partition
# Attention! One-Hot Encoding

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

# Data import

train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

1.2 超參數(shù)范圍給定

首先，我們需要對隨機(jī)森林模型超參數(shù)各自的范圍加以確定，之后我們將在這些范圍內(nèi)確定各個超參數(shù)的最終最優(yōu)取值。換句話說，我們現(xiàn)在先給每一個需要擇優(yōu)的超參數(shù)劃定一個很大很大的范圍（例如對于“決策樹個數(shù)”這個超參數(shù)，我們可以將其范圍劃定在10到5000這樣一個很大的范圍），然后后期將用擇優(yōu)算法在每一個超參數(shù)的這個范圍內(nèi)進(jìn)行搜索。

在此，我們先要確定對哪些超參數(shù)進(jìn)行擇優(yōu)。本文選擇在隨機(jī)森林算法中比較重要的幾個超參數(shù)進(jìn)行調(diào)優(yōu)，分別是：決策樹個數(shù)n_estimators，決策樹最大深度max_depth，最小分離樣本數(shù)（即拆分決策樹節(jié)點所需的最小樣本數(shù)）min_samples_split，最小葉子節(jié)點樣本數(shù)（即一個葉節(jié)點所需包含的最小樣本數(shù)）min_samples_leaf，最大分離特征數(shù)（即尋找最佳節(jié)點分割時要考慮的特征變量數(shù)量）max_features，以及是否進(jìn)行隨機(jī)抽樣bootstrap等六種。關(guān)于上述超參數(shù)如果大家不是太了解具體的含義，可以查看文章Python實現(xiàn)隨機(jī)森林RF并對比自變量的重要性的1.5部分，可能就會比較好理解了（不過其實不理解也不影響接下來的操作）。

這里提一句，其實隨機(jī)森林的超參數(shù)并不止上述這些，我這里也是結(jié)合數(shù)據(jù)情況與最終的精度需求，選擇了相對比較常用的幾個超參數(shù)；大家依據(jù)各自實際需要，選擇需要調(diào)整的超參數(shù)，并用同樣的代碼思路執(zhí)行即可。

# Search optimal hyperparameter

n_estimators_range=[int(x) for x in np.linspace(start=50,stop=3000,num=60)]
max_features_range=['auto','sqrt']
max_depth_range=[int(x) for x in np.linspace(10,500,num=50)]
max_depth_range.append(None)
min_samples_split_range=[2,5,10]
min_samples_leaf_range=[1,2,4,8]
bootstrap_range=[True,False]

random_forest_hp_range={'n_estimators':n_estimators_range,
                        'max_features':max_features_range,
                        'max_depth':max_depth_range,
                        'min_samples_split':min_samples_split_range,
                        'min_samples_leaf':min_samples_leaf_range
                        # 'bootstrap':bootstrap_range
                        }
pprint(random_forest_hp_range)

可以看到，上述代碼首先是對六種超參數(shù)劃定了一個范圍，然后將其分別存入了一個超參數(shù)范圍字典random_forest_hp_range。在這里大家可以看到，我在存入字典時，將bootstrap的范圍這一句注釋掉了，這是由于當(dāng)時運(yùn)行后我發(fā)現(xiàn)bootstrap還是處于True這個狀態(tài)比較好（也就是得到的結(jié)果精度比較高），因此就取消了這一超參數(shù)的擇優(yōu)；大家依據(jù)個人數(shù)據(jù)與模型的實際情況來即可~

我們可以看一下random_forest_hp_range變量的取值情況：

沒錯，它是一個字典，鍵就是超參數(shù)的名稱，值就是超參數(shù)的范圍。因為我將bootstrap注釋掉了，因此這個字典里就沒有bootstrap這一項了~

1.3 超參數(shù)隨機(jī)匹配擇優(yōu)

上面我們確定了每一種超參數(shù)各自的范圍，那么接下來我們就將他們分別組合，對比每一個超參數(shù)取值組合所得到的模型結(jié)果，從而確定最優(yōu)超參數(shù)組合。

其實大家會發(fā)現(xiàn)，我們上面劃定六種超參數(shù)（除去我后來刪除的bootstrap的話是五種），如果按照排列組合來計算的話，會有很多很多種組合方式，如果要一一嘗試未免也太麻煩了。因此，我們用到RandomizedSearchCV這一功能——其將隨機(jī)匹配每一種超參數(shù)組合，并輸出最優(yōu)的組合。換句話說，我們用RandomizedSearchCV來進(jìn)行隨機(jī)的排列，而不是對所有的超參數(shù)排列組合方法進(jìn)行遍歷。這樣子確實可以節(jié)省很多時間。

random_forest_model_test_base=RandomForestRegressor()
random_forest_model_test_random=RandomizedSearchCV(estimator=random_forest_model_test_base,
                                                   param_distributions=random_forest_hp_range,
                                                   n_iter=200,
                                                   n_jobs=-1,
                                                   cv=3,
                                                   verbose=1,
                                                   random_state=random_forest_seed
                                                   )
random_forest_model_test_random.fit(train_X,train_Y)

best_hp_now=random_forest_model_test_random.best_params_
pprint(best_hp_now)

由代碼可以看到，我們首先建立一個隨機(jī)森林模型random_forest_model_test_base，并將其帶入到RandomizedSearchCV中；其中，RandomizedSearchCV的參數(shù)組合就是剛剛我們看的random_forest_hp_range，n_iter就是具體隨機(jī)搭配超參數(shù)組合的次數(shù)（這個次數(shù)因此肯定是越大涵蓋的組合數(shù)越多，效果越好，但是也越費時間），cv是交叉驗證的折數(shù)（RandomizedSearchCV衡量每一種組合方式的效果就是用交叉驗證來進(jìn)行的），n_jobs與verbose是關(guān)于模型線程、日志相關(guān)的信息，大家不用太在意，random_state是隨機(jī)森林中隨機(jī)抽樣的隨機(jī)數(shù)種子。

之后，我們對random_forest_model_test_random加以訓(xùn)練，并獲取其所得到的最優(yōu)超參數(shù)匹配組合best_hp_now。在這里，模型的訓(xùn)練次數(shù)就是n_iter與cv的乘積（因為交叉驗證有幾折，那么就需要運(yùn)行幾次；而一共有n_iter個參數(shù)匹配組合，因此總次數(shù)就是二者相乘）。例如，用上述代碼那么一共就需要運(yùn)行600次。運(yùn)行過程在程序中將自動顯示，如下圖。

可以看到，一共有600次fit，我這里共花了11.7min完成。具體速度和電腦配置、自變量與因變量數(shù)據(jù)量大小，以及電腦此時內(nèi)存等等都有關(guān)。

運(yùn)行完畢，我們來看看找到的最有超參數(shù)組合best_hp_now。

可以看到，經(jīng)過200種組合匹配方式的計算，目前五種超參數(shù)最優(yōu)的組合搭配方式已經(jīng)得到了。其實每一次得到的超參數(shù)最優(yōu)組合結(jié)果差距也是蠻大的——例如同一批數(shù)據(jù)，有的時候我得到的n_estimators最優(yōu)值是如圖所示的100，有的時候也會是2350；所以大家依據(jù)實際情況來判斷即可~

那么接下來，我們就繼續(xù)對這一best_hp_now所示的結(jié)果進(jìn)行更進(jìn)一步的擇優(yōu)。

1.4 超參數(shù)遍歷匹配擇優(yōu)

剛剛我們基于RandomizedSearchCV，實現(xiàn)了200次的超參數(shù)隨機(jī)匹配與擇優(yōu)；但是此時的結(jié)果是一個隨機(jī)不完全遍歷后所得的結(jié)果，因此其最優(yōu)組合可能并不是全局最優(yōu)的，而只是一個大概的最優(yōu)范圍。因此接下來，我們需要依據(jù)上述所得到的隨機(jī)最優(yōu)匹配結(jié)果，進(jìn)行遍歷全部組合的匹配擇優(yōu)。

遍歷匹配即在隨機(jī)匹配最優(yōu)結(jié)果的基礎(chǔ)上，在其臨近范圍內(nèi)選取幾個數(shù)值，并通過GridSearchCV對每一種匹配都遍歷，從而選出比較好的超參數(shù)最終取值結(jié)果。

# Grid Search

random_forest_hp_range_2={'n_estimators':[60,100,200],
                          'max_features':[12,13],
                          'max_depth':[350,400,450],
                          'min_samples_split':[2,3] # Greater than 1
                          # 'min_samples_leaf':[1,2]
                          # 'bootstrap':bootstrap_range
                          }
random_forest_model_test_2_base=RandomForestRegressor()
random_forest_model_test_2_random=GridSearchCV(estimator=random_forest_model_test_2_base,
                                               param_grid=random_forest_hp_range_2,
                                               cv=3,
                                               verbose=1,
                                               n_jobs=-1)
random_forest_model_test_2_random.fit(train_X,train_Y)

best_hp_now_2=random_forest_model_test_2_random.best_params_
pprint(best_hp_now_2)

大家可以看到，本部分代碼其實和1.3部分比較類似。我們著重講解random_forest_hp_range_2。其中，n_estimators設(shè)定為了[60,100,200]，這是由于我們剛剛得到的best_hp_now中n_estimators為100，那么我們就在100附近選取幾個值，作為新的n_estimators范圍；max_features也是一樣的，因為best_hp_now中max_features為'sqrt'，也就是輸入數(shù)據(jù)特征（自變量）的個數(shù)的平方根，而我這里自變量個數(shù)大概是150多個，因此其開平方之后就是12.24左右，那么就選擇其附近的兩個數(shù)（需要為整數(shù)），因此就選擇了[12,13]。其他的超參數(shù)取值也是類似的。這里我將'min_samples_leaf'也給注釋掉了是因為我跑了很多次發(fā)現(xiàn)，'min_samples_leaf'還是取1最好，那么就直接選擇為默認(rèn)1（'min_samples_leaf'在不指定的情況下默認(rèn)為1）即可，因為超參數(shù)范圍越小，程序跑的就越快。

這里程序運(yùn)行的次數(shù)就是每一種超參數(shù)取值個數(shù)的排列組合次數(shù)乘以交叉驗證的折數(shù)，也就是(2*3*2*3)*3=108次，我們來看看是不是108次：

很明顯，沒有問題，就是108個fit。和前面的600次fit比起來，這樣就快很多了（這也是為什么我直接將'min_samples_leaf'與'bootstrap'注釋掉的原因；要是這兩個超參數(shù)也參與的話，那么假設(shè)他們兩個各有2個取值的話，總時間至少就要翻2*2=4倍）。

再來看看經(jīng)過遍歷擇優(yōu)后的最優(yōu)超參數(shù)匹配取值best_hp_now_2。

以上就是我們經(jīng)過一次隨機(jī)擇優(yōu)、一次遍歷擇優(yōu)之后的超參數(shù)結(jié)果（不要忘記了'min_samples_leaf'與'bootstrap'還要分別取1和True，也就是默認(rèn)值）。如果大家感覺這個組合搭配還不是很好，那么可以繼續(xù)執(zhí)行本文“1.4 超參數(shù)遍歷匹配擇優(yōu)”部分1到2次，精度可能會有更進(jìn)一步的提升。

1.5 模型運(yùn)行與精度評定

結(jié)束了上述超參數(shù)擇優(yōu)過程，我們就可以進(jìn)行模型運(yùn)行、精度評定與結(jié)果輸出等操作。本部分內(nèi)容除了第一句代碼（將最優(yōu)超參數(shù)組合分配給模型）之外，其余部分由于在之前的博客中已經(jīng)詳細(xì)介紹過了，本文就不再贅述~大家直接查看文章Python實現(xiàn)隨機(jī)森林RF并對比自變量的重要性即可。

# Build RF regression model with optimal hyperparameters

random_forest_model_final=random_forest_model_test_2_random.best_estimator_

# Predict test set data

random_forest_predict=random_forest_model_test_2_random.predict(test_X)
random_forest_error=random_forest_predict-test_Y

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,
                     random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model_final.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

# Calculate the importance of variables

random_forest_importance=list(random_forest_model_final.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,
                                                                 random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

2 完整代碼

本文所用完整代碼如下。

# -*- coding: utf-8 -*-
"""
Created on Sun Mar 21 22:05:37 2021

@author: fkxxgis
"""

import pydot
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
from pprint import pprint
from sklearn import metrics
from openpyxl import load_workbook
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# Attention! Data Partition
# Attention! One-Hot Encoding

train_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Train.csv'
test_data_path='G:/CropYield/03_DL/00_Data/AllDataAll_Test.csv'
write_excel_path='G:/CropYield/03_DL/05_NewML/ParameterResult_ML.xlsx'
tree_graph_dot_path='G:/CropYield/03_DL/05_NewML/tree.dot'
tree_graph_png_path='G:/CropYield/03_DL/05_NewML/tree.png'

random_seed=44
random_forest_seed=np.random.randint(low=1,high=230)

# Data import

train_data=pd.read_csv(train_data_path,header=0)
test_data=pd.read_csv(test_data_path,header=0)

# Separate independent and dependent variables

train_Y=np.array(train_data['Yield'])
train_X=train_data.drop(['ID','Yield'],axis=1)
train_X_column_name=list(train_X.columns)
train_X=np.array(train_X)

test_Y=np.array(test_data['Yield'])
test_X=test_data.drop(['ID','Yield'],axis=1)
test_X=np.array(test_X)

# Search optimal hyperparameter

n_estimators_range=[int(x) for x in np.linspace(start=50,stop=3000,num=60)]
max_features_range=['auto','sqrt']
max_depth_range=[int(x) for x in np.linspace(10,500,num=50)]
max_depth_range.append(None)
min_samples_split_range=[2,5,10]
min_samples_leaf_range=[1,2,4,8]
bootstrap_range=[True,False]

random_forest_hp_range={'n_estimators':n_estimators_range,
                        'max_features':max_features_range,
                        'max_depth':max_depth_range,
                        'min_samples_split':min_samples_split_range,
                        'min_samples_leaf':min_samples_leaf_range
                        # 'bootstrap':bootstrap_range
                        }
pprint(random_forest_hp_range)

random_forest_model_test_base=RandomForestRegressor()
random_forest_model_test_random=RandomizedSearchCV(estimator=random_forest_model_test_base,
                                                   param_distributions=random_forest_hp_range,
                                                   n_iter=200,
                                                   n_jobs=-1,
                                                   cv=3,
                                                   verbose=1,
                                                   random_state=random_forest_seed
                                                   )
random_forest_model_test_random.fit(train_X,train_Y)

best_hp_now=random_forest_model_test_random.best_params_
pprint(best_hp_now)

# Grid Search

random_forest_hp_range_2={'n_estimators':[60,100,200],
                          'max_features':[12,13],
                          'max_depth':[350,400,450],
                          'min_samples_split':[2,3] # Greater than 1
                          # 'min_samples_leaf':[1,2]
                          # 'bootstrap':bootstrap_range
                          }
random_forest_model_test_2_base=RandomForestRegressor()
random_forest_model_test_2_random=GridSearchCV(estimator=random_forest_model_test_2_base,
                                               param_grid=random_forest_hp_range_2,
                                               cv=3,
                                               verbose=1,
                                               n_jobs=-1)
random_forest_model_test_2_random.fit(train_X,train_Y)

best_hp_now_2=random_forest_model_test_2_random.best_params_
pprint(best_hp_now_2)

# Build RF regression model with optimal hyperparameters

random_forest_model_final=random_forest_model_test_2_random.best_estimator_

# Predict test set data

random_forest_predict=random_forest_model_test_2_random.predict(test_X)
random_forest_error=random_forest_predict-test_Y

# Draw test plot

plt.figure(1)
plt.clf()
ax=plt.axes(aspect='equal')
plt.scatter(test_Y,random_forest_predict)
plt.xlabel('True Values')
plt.ylabel('Predictions')
Lims=[0,10000]
plt.xlim(Lims)
plt.ylim(Lims)
plt.plot(Lims,Lims)
plt.grid(False)
    
plt.figure(2)
plt.clf()
plt.hist(random_forest_error,bins=30)
plt.xlabel('Prediction Error')
plt.ylabel('Count')
plt.grid(False)

# Verify the accuracy

random_forest_pearson_r=stats.pearsonr(test_Y,random_forest_predict)
random_forest_R2=metrics.r2_score(test_Y,random_forest_predict)
random_forest_RMSE=metrics.mean_squared_error(test_Y,random_forest_predict)**0.5
print('Pearson correlation coefficient is {0}, and RMSE is {1}.'.format(random_forest_pearson_r[0],
                                                                        random_forest_RMSE))

# Save key parameters

excel_file=load_workbook(write_excel_path)
excel_all_sheet=excel_file.sheetnames
excel_write_sheet=excel_file[excel_all_sheet[0]]
excel_write_sheet=excel_file.active
max_row=excel_write_sheet.max_row
excel_write_content=[random_forest_pearson_r[0],random_forest_R2,random_forest_RMSE,
                     random_seed,random_forest_seed]
for i in range(len(excel_write_content)):
        exec("excel_write_sheet.cell(max_row+1,i+1).value=excel_write_content[i]")
excel_file.save(write_excel_path)

# Draw decision tree visualizing plot

random_forest_tree=random_forest_model_final.estimators_[5]
export_graphviz(random_forest_tree,out_file=tree_graph_dot_path,
                feature_names=train_X_column_name,rounded=True,precision=1)
(random_forest_graph,)=pydot.graph_from_dot_file(tree_graph_dot_path)
random_forest_graph.write_png(tree_graph_png_path)

# Calculate the importance of variables

random_forest_importance=list(random_forest_model_final.feature_importances_)
random_forest_feature_importance=[(feature,round(importance,8)) 
                                  for feature, importance in zip(train_X_column_name,
                                                                 random_forest_importance)]
random_forest_feature_importance=sorted(random_forest_feature_importance,key=lambda x:x[1],reverse=True)
plt.figure(3)
plt.clf()
importance_plot_x_values=list(range(len(random_forest_importance)))
plt.bar(importance_plot_x_values,random_forest_importance,orientation='vertical')
plt.xticks(importance_plot_x_values,train_X_column_name,rotation='vertical')
plt.xlabel('Variable')
plt.ylabel('Importance')
plt.title('Variable Importances')

以上就是Python實現(xiàn)隨機(jī)森林RF模型超參數(shù)的優(yōu)化詳解的詳細(xì)內(nèi)容，更多關(guān)于Python隨機(jī)森林模型超參數(shù)優(yōu)化的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: