python自動分箱,計算woe,iv的實例代碼

更新時間：2019年11月22日 09:55:32 作者：kidxu

今天小編就為大家分享一篇python自動分箱,計算woe,iv的實例代碼，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧

筆者之前用R開發(fā)評分卡時，需要進行分箱計算woe及iv值，采用的R包是smbinning,它可以自動進行分箱。近期換用python開發(fā)，也想實現(xiàn)自動分箱功能，找到了一個woe包，地址https://pypi.org/project/woe/，可以直接 pip install woe安裝。

由于此woe包官網介紹及給的例子不是很好理解，關于每個函數的使用也沒有很詳細的說明，經過一番仔細探究后以此文記錄一下該woe包的使用及其計算原理。

例子

官方給的例子不是很好理解，以下是我寫的一個使用示例。以此例來說明各主要函數的使用方法。計算woe的各相關函數主要在feature_process.py中定義。

import woe.feature_process as fp
import woe.eval as eval
 
#%% woe分箱, iv and transform
data_woe = data #用于存儲所有數據的woe值
civ_list = []
n_positive = sum(data['target'])
n_negtive = len(data) - n_positive
for column in list(data.columns[1:]):
 if data[column].dtypes == 'object':
 civ = fp.proc_woe_discrete(data, column, n_positive, n_negtive, 0.05*len(data), alpha=0.05)
 else:  
 civ = fp.proc_woe_continuous(data, column, n_positive, n_negtive, 0.05*len(data), alpha=0.05)
 civ_list.append(civ)
 data_woe[column] = fp.woe_trans(data[column], civ)
 
civ_df = eval.eval_feature_detail(civ_list,'output_feature_detail_0315.csv')
#刪除iv值過小的變量
iv_thre = 0.001
iv = civ_df[['var_name','iv']].drop_duplicates()
x_columns = iv.var_name[iv.iv > iv_thre]

計算分箱，woe,iv

核心函數主要是freature_process.proc_woe_discrete()與freature_process.proc_woe_continuous()，分別用于計算連續(xù)變量與離散變量的woe。它們的輸入形式相同：

proc_woe_discrete(df,var,global_bt,global_gt,min_sample,alpha=0.01)

proc_woe_continuous(df,var,global_bt,global_gt,min_sample,alpha=0.01)

輸入：

df: DataFrame，要計算woe的數據，必須包含'target'變量，且變量取值為{0，1}

var:要計算woe的變量名

global_bt:全局變量bad total。df的正樣本數量

global_gt:全局變量good total。df的負樣本數量

min_sample:指定每個bin中最小樣本量，一般設為樣本總量的5%。

alpha:用于自動計算分箱時的一個標準，默認0.01.如果iv_劃分>iv_不劃分*（1+alpha)則劃分。

輸出：一個自定義的InfoValue類的object，包含了分箱的一切結果信息。

該類定義見以下一段代碼。

class InfoValue(object):
 '''
 InfoValue Class
 '''
 def __init__(self):
 self.var_name = []
 self.split_list = []
 self.iv = 0
 self.woe_list = []
 self.iv_list = []
 self.is_discrete = 0
 self.sub_total_sample_num = []
 self.positive_sample_num = []
 self.negative_sample_num = []
 self.sub_total_num_percentage = []
 self.positive_rate_in_sub_total = []
 self.negative_rate_in_sub_total = []
 
 def init(self,civ):
 self.var_name = civ.var_name
 self.split_list = civ.split_list
 self.iv = civ.iv
 self.woe_list = civ.woe_list
 self.iv_list = civ.iv_list
 self.is_discrete = civ.is_discrete
 self.sub_total_sample_num = civ.sub_total_sample_num
 self.positive_sample_num = civ.positive_sample_num
 self.negative_sample_num = civ.negative_sample_num
 self.sub_total_num_percentage = civ.sub_total_num_percentage
 self.positive_rate_in_sub_total = civ.positive_rate_in_sub_total
 self.negative_rate_in_sub_total = civ.negative_rate_in_sub_total

打印分箱結果

eval.eval_feature_detail(Info_Value_list,out_path=False)

輸入：

Info_Value_list:存儲各變量分箱結果(proc_woe_continuous/discrete的返回值）的List.

out_path:指定的分箱結果存儲路徑，輸出為csv文件

輸出：

各變量分箱結果的DataFrame。各列分別包含如下信息：


var_name	變量名
split_list	劃分區(qū)間
sub_total_sample_num	該區(qū)間總樣本數
positive_sample_num	該區(qū)間正樣本數
negative_sample_num	該區(qū)間負樣本數
sub_total_num_percentage	該區(qū)間總占比
positive_rate_in_sub_total	該區(qū)間正樣本占總正樣本比例
woe_list	woe
iv_list	該區(qū)間iv
iv	該變量iv(各區(qū)間iv之和）