Python3 ID3決策樹(shù)判斷申請(qǐng)貸款是否成功的實(shí)現(xiàn)代碼
1. 定義生成樹(shù)
# -*- coding: utf-8 -*-
#生成樹(shù)的函數(shù)
from numpy import *
import numpy as np
import pandas as pd
from math import log
import operator
# 計(jì)算數(shù)據(jù)集的信息熵(Information Gain)增益函數(shù)(機(jī)器學(xué)習(xí)實(shí)戰(zhàn)中信息熵叫香農(nóng)熵)
def calcInfoEnt(dataSet):#本題中Label即好or壞瓜 #dataSet每一列是一個(gè)屬性(列末是Label)
numEntries = len(dataSet) #每一行是一個(gè)樣本
labelCounts = {} #給所有可能的分類創(chuàng)建字典labelCounts
for featVec in dataSet: #按行循環(huán):即rowVev取遍了數(shù)據(jù)集中的每一行
currentLabel = featVec[-1] #故featVec[-1]取遍每行最后一個(gè)值即Label
if currentLabel not in labelCounts.keys(): #如果當(dāng)前的Label在字典中還沒(méi)有
labelCounts[currentLabel] = 0 #則先賦值0來(lái)創(chuàng)建這個(gè)詞
labelCounts[currentLabel] += 1 #計(jì)數(shù), 統(tǒng)計(jì)每類Label數(shù)量(這行不受if限制)
InfoEnt = 0.0
for key in labelCounts: #遍歷每類Label
prob = float(labelCounts[key])/numEntries #各類Label熵累加
InfoEnt -= prob * log(prob,2) #ID3用的信息熵增益公式
return InfoEnt
### 對(duì)于離散特征: 取出該特征取值為value的所有樣本
def splitDiscreteDataSet(dataSet, axis, value): #dataSet是當(dāng)前結(jié)點(diǎn)(待劃分)集合,axis指示劃分所依據(jù)的屬性,value該屬性用于劃分的取值
retDataSet = [] #為return Data Set分配一個(gè)列表用來(lái)儲(chǔ)存
for featVec in dataSet:
if featVec[axis] == value:
reducedFeatVec = featVec[:axis] #該特征之前的特征仍保留在樣本dataSet中
reducedFeatVec.extend(featVec[axis+1:]) #該特征之后的特征仍保留在樣本dataSet中
retDataSet.append(reducedFeatVec) #把這個(gè)樣本加到list中
return retDataSet
### 對(duì)于連續(xù)特征: 返回特征取值大于value的所有樣本(以value為閾值將集合分成兩部分)
def splitContinuousDataSet(dataSet, axis, value):
retDataSetG = [] #將儲(chǔ)存取值大于value的樣本
retDataSetL = [] #將儲(chǔ)存取值小于value的樣本
for featVec in dataSet:
if featVec[axis] > value:
reducedFeatVecG = featVec[:axis]
reducedFeatVecG.extend(featVec[axis+1:])
retDataSetG.append(reducedFeatVecG)
else:
reducedFeatVecL = featVec[:axis]
reducedFeatVecL.extend(featVec[axis+1:])
retDataSetL.append(reducedFeatVecL)
return retDataSetG,retDataSetL #返回兩個(gè)集合, 是含2個(gè)元素的tuple形式
### 根據(jù)InfoGain選擇當(dāng)前最好的劃分特征(以及對(duì)于連續(xù)變量還要選擇以什么值劃分)
def chooseBestFeatureToSplit(dataSet,labels):
numFeatures = len(dataSet[0])-1
baseEntropy = calcInfoEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
bestSplitDict = {}
for i in range(numFeatures):
#遍歷所有特征:下面這句是取每一行的第i個(gè), 即得當(dāng)前集合所有樣本第i個(gè)feature的值
featList = [example[i] for example in dataSet]
#判斷是否為離散特征
if not (type(featList[0]).__name__=='float' or type(featList[0]).__name__=='int'):
# 對(duì)于離散特征:求若以該特征劃分的熵增
uniqueVals = set(featList) #從列表中創(chuàng)建集合set(得列表唯一元素值)
newEntropy = 0.0
for value in uniqueVals: #遍歷該離散特征每個(gè)取值
subDataSet = splitDiscreteDataSet(dataSet, i, value)#計(jì)算每個(gè)取值的信息熵
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcInfoEnt(subDataSet)#各取值的熵累加
infoGain = baseEntropy - newEntropy #得到以該特征劃分的熵增
# 對(duì)于連續(xù)特征:求若以該特征劃分的熵增(區(qū)別:n個(gè)數(shù)據(jù)則需添n-1個(gè)候選劃分點(diǎn), 并選最佳劃分點(diǎn))
else: #產(chǎn)生n-1個(gè)候選劃分點(diǎn)
sortfeatList=sorted(featList)
splitList=[]
for j in range(len(sortfeatList)-1): #產(chǎn)生n-1個(gè)候選劃分點(diǎn)
splitList.append((sortfeatList[j] + sortfeatList[j+1])/2.0)
bestSplitEntropy = 10000 #設(shè)定一個(gè)很大的熵值(之后用)
#遍歷n-1個(gè)候選劃分點(diǎn): 求選第j個(gè)候選劃分點(diǎn)劃分時(shí)的熵增, 并選出最佳劃分點(diǎn)
for j in range(len(splitList)):
value = splitList[j]
newEntropy = 0.0
DataSet = splitContinuousDataSet(dataSet, i, value)
subDataSetG = DataSet[0]
subDataSetL = DataSet[1]
probG = len(subDataSetG) / float(len(dataSet))
newEntropy += probG * calcInfoEnt(subDataSetG)
probL = len(subDataSetL) / float(len(dataSet))
newEntropy += probL * calcInfoEnt(subDataSetL)
if newEntropy < bestSplitEntropy:
bestSplitEntropy = newEntropy
bestSplit = j
bestSplitDict[labels[i]] = splitList[bestSplit]#字典記錄當(dāng)前連續(xù)屬性的最佳劃分點(diǎn)
infoGain = baseEntropy - bestSplitEntropy #計(jì)算以該節(jié)點(diǎn)劃分的熵增
# 在所有屬性(包括連續(xù)和離散)中選擇可以獲得最大熵增的屬性
if infoGain > bestInfoGain:
bestInfoGain = infoGain
bestFeature = i
#若當(dāng)前節(jié)點(diǎn)的最佳劃分特征為連續(xù)特征,則需根據(jù)“是否小于等于其最佳劃分點(diǎn)”進(jìn)行二值化處理
#即將該特征改為“是否小于等于bestSplitValue”, 例如將“密度”變?yōu)椤懊芏?lt;=0.3815”
#注意:以下這段直接操作了原dataSet數(shù)據(jù), 之前的那些float型的值相應(yīng)變?yōu)?和1
#【為何這樣做?】在函數(shù)createTree()末尾將看到解釋
if type(dataSet[0][bestFeature]).__name__=='float' or type(dataSet[0][bestFeature]).__name__=='int':
bestSplitValue = bestSplitDict[labels[bestFeature]]
labels[bestFeature] = labels[bestFeature] + '<=' + str(bestSplitValue)
for i in range(shape(dataSet)[0]):
if dataSet[i][bestFeature] <= bestSplitValue:
dataSet[i][bestFeature] = 1
else:
dataSet[i][bestFeature] = 0
return bestFeature
# 若特征已經(jīng)劃分完,節(jié)點(diǎn)下的樣本還沒(méi)有統(tǒng)一取值,則需要進(jìn)行投票:計(jì)算每類Label個(gè)數(shù), 取max者
def majorityCnt(classList):
classCount = {} #將創(chuàng)建鍵值為L(zhǎng)abel類型的字典
for vote in classList:
if vote not in classCount.keys():
classCount[vote] = 0 #第一次出現(xiàn)的Label加入字典
classCount[vote] += 1 #計(jì)數(shù)
return max(classCount)
2. 遞歸產(chǎn)生決策樹(shù)
# 主程序:遞歸產(chǎn)生決策樹(shù)
# dataSet:當(dāng)前用于構(gòu)建樹(shù)的數(shù)據(jù)集, 最開(kāi)始就是data_full,然后隨著劃分的進(jìn)行越來(lái)越小。這是因?yàn)檫M(jìn)行到到樹(shù)分叉點(diǎn)上了. 第一次劃分之前17個(gè)瓜的數(shù)據(jù)在根節(jié)點(diǎn),然后選擇第一個(gè)bestFeat是紋理. 紋理的取值有清晰、模糊、稍糊三種;將瓜分成了清晰(9個(gè)),稍糊(5個(gè)),模糊(3個(gè)),這時(shí)應(yīng)該將劃分的類別減少1以便于下次劃分。
# labels:當(dāng)前數(shù)據(jù)集中有的用于劃分的類別(這是因?yàn)橛行㎜abel當(dāng)前數(shù)據(jù)集沒(méi)了, 比如假如到某個(gè)點(diǎn)上西瓜都是淺白沒(méi)有深綠了)
# data_full:全部的數(shù)據(jù)
# label_full:全部的類別
numLine = numColumn = 2 #這句是因?yàn)橹笠胓lobal numLine……至于為什么我一定要用global
# 我也不完全理解。如果我只定義local變量總報(bào)錯(cuò),我只好在那里的if里用global變量了。求解。
def createTree(dataSet,labels,data_full,labels_full):
classList = [example[-1] for example in dataSet]
#遞歸停止條件1:當(dāng)前節(jié)點(diǎn)所有樣本屬于同一類;(注:count()方法統(tǒng)計(jì)某元素在列表中出現(xiàn)的次數(shù))
if classList.count(classList[0]) == len(classList):
return classList[0]
#遞歸停止條件2:當(dāng)前節(jié)點(diǎn)上樣本集合為空集(即特征的某個(gè)取值上已經(jīng)沒(méi)有樣本了):
global numLine,numColumn
(numLine,numColumn) = shape(dataSet)
if float(numLine) == 0:
return 'empty'
#遞歸停止條件3:所有可用于劃分的特征均使用過(guò)了,則調(diào)用majorityCnt()投票定Label;
if float(numColumn) == 1:
return majorityCnt(classList)
#不停止時(shí)繼續(xù)劃分:
bestFeat = chooseBestFeatureToSplit(dataSet,labels)#調(diào)用函數(shù)找出當(dāng)前最佳劃分特征是第幾個(gè)
bestFeatLabel = labels[bestFeat] #當(dāng)前最佳劃分特征
myTree = {bestFeatLabel:{}}
featValues = [example[bestFeat] for example in dataSet]
uniqueVals = set(featValues)
if type(dataSet[0][bestFeat]).__name__=='str':
currentlabel = labels_full.index(labels[bestFeat])
featValuesFull = [example[currentlabel] for example in data_full]
uniqueValsFull = set(featValuesFull)
del(labels[bestFeat]) #劃分完后, 即當(dāng)前特征已經(jīng)使用過(guò)了, 故將其從“待劃分特征集”中刪去
#【遞歸調(diào)用】針對(duì)當(dāng)前用于劃分的特征(beatFeat)的每個(gè)取值,劃分出一個(gè)子樹(shù)。
for value in uniqueVals: #遍歷該特征【現(xiàn)存的】取值
subLabels = labels[:]
if type(dataSet[0][bestFeat]).__name__=='str':
uniqueValsFull.remove(value) #劃分后刪去(從uniqueValsFull中刪!)
myTree[bestFeatLabel][value] = createTree(splitDiscreteDataSet(dataSet,bestFeat,value),subLabels,data_full,labels_full)#用splitDiscreteDataSet()
#是由于, 所有的連續(xù)特征在劃分后都被我們定義的chooseBestFeatureToSplit()處理成離散取值了。
if type(dataSet[0][bestFeat]).__name__=='str': #若該特征離散【更詳見(jiàn)后注】
for value in uniqueValsFull:#則可能有些取值已經(jīng)不在【現(xiàn)存的】取值中了
#這就是上面為何從“uniqueValsFull”中刪去
#因?yàn)槟切┈F(xiàn)有數(shù)據(jù)集中沒(méi)取到的該特征的值,保留在了其中
myTree[bestFeatLabel][value] = majorityCnt(classList)
return myTree
3. 調(diào)用生成樹(shù)
#生成樹(shù)調(diào)用的語(yǔ)句 df = pd.read_excel(r'E:\BaiduNetdiskDownload\spss\數(shù)據(jù)\實(shí)驗(yàn)data\銀行貸款.xlsx') data = df.values[:,1:].tolist() data_full = data[:] labels = df.columns.values[1:-1].tolist() labels_full = labels[:] myTree = createTree(data,labels,data_full,labels_full)
查看數(shù)據(jù)
data

labels

4. 繪制決策樹(shù)
#繪決策樹(shù)的函數(shù) import matplotlib.pyplot as plt decisionNode = dict(boxstyle = "sawtooth",fc = "0.8") #定義分支點(diǎn)的樣式 leafNode = dict(boxstyle = "round4",fc = "0.8") #定義葉節(jié)點(diǎn)的樣式 arrow_args = dict(arrowstyle = "<-") #定義箭頭標(biāo)識(shí)樣式 # 計(jì)算樹(shù)的葉子節(jié)點(diǎn)數(shù)量 def getNumLeafs(myTree): numLeafs = 0 firstStr = list(myTree.keys())[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict': numLeafs += getNumLeafs(secondDict[key]) else: numLeafs += 1 return numLeafs # 計(jì)算樹(shù)的最大深度 def getTreeDepth(myTree): maxDepth = 0 firstStr = list(myTree.keys())[0] secondDict = myTree[firstStr] for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict': thisDepth = 1 + getTreeDepth(secondDict[key]) else: thisDepth = 1 if thisDepth > maxDepth: maxDepth = thisDepth return maxDepth # 畫(huà)出節(jié)點(diǎn) def plotNode(nodeTxt,centerPt,parentPt,nodeType): createPlot.ax1.annotate(nodeTxt,xy = parentPt,xycoords = 'axes fraction',xytext = centerPt,textcoords = 'axes fraction',va = "center", ha = "center",bbox = nodeType,arrowprops = arrow_args) # 標(biāo)箭頭上的文字 def plotMidText(cntrPt,parentPt,txtString): lens = len(txtString) xMid = (parentPt[0] + cntrPt[0]) / 2.0 - lens*0.002 yMid = (parentPt[1] + cntrPt[1]) / 2.0 createPlot.ax1.text(xMid,yMid,txtString) def plotTree(myTree,parentPt,nodeTxt): numLeafs = getNumLeafs(myTree) depth = getTreeDepth(myTree) firstStr = list(myTree.keys())[0] cntrPt = (plotTree.x0ff + (1.0 + float(numLeafs))/2.0/plotTree.totalW,plotTree.y0ff) plotMidText(cntrPt,parentPt,nodeTxt) plotNode(firstStr,cntrPt,parentPt,decisionNode) secondDict = myTree[firstStr] plotTree.y0ff = plotTree.y0ff - 1.0/plotTree.totalD for key in secondDict.keys(): if type(secondDict[key]).__name__=='dict': plotTree(secondDict[key],cntrPt,str(key)) else: plotTree.x0ff = plotTree.x0ff + 1.0/plotTree.totalW plotNode(secondDict[key],(plotTree.x0ff,plotTree.y0ff),cntrPt,leafNode) plotMidText((plotTree.x0ff,plotTree.y0ff),cntrPt,str(key)) plotTree.y0ff = plotTree.y0ff + 1.0/plotTree.totalD def createPlot(inTree): fig = plt.figure(1,facecolor = 'white') fig.clf() axprops = dict(xticks = [],yticks = []) createPlot.ax1 = plt.subplot(111,frameon = False,**axprops) plotTree.totalW = float(getNumLeafs(inTree)) plotTree.totalD = float(getTreeDepth(inTree)) plotTree.x0ff = -0.5/plotTree.totalW plotTree.y0ff = 1.0 plotTree(inTree,(0.5,1.0),'') plt.show()
5. 調(diào)用函數(shù)
#命令繪決策樹(shù)的圖 createPlot(myTree)
myTree
總結(jié)
到此這篇關(guān)于Python3 ID3決策樹(shù)判斷申請(qǐng)貸款是否成功的實(shí)現(xiàn)代碼的文章就介紹到這了,更多相關(guān)python ID3 決策樹(shù)判斷內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
python小白練習(xí)題之條件控制與循環(huán)控制
Python 中的條件控制和循環(huán)語(yǔ)句都非常簡(jiǎn)單,也非常容易理解,與其他編程語(yǔ)言類似,下面這篇文章主要給大家介紹了關(guān)于python小白練習(xí)題之條件控制與循環(huán)控制的相關(guān)資料,需要的朋友可以參考下2021-10-10
flask/django 動(dòng)態(tài)查詢表結(jié)構(gòu)相同表名不同數(shù)據(jù)的Model實(shí)現(xiàn)方法
今天小編就為大家分享一篇flask/django 動(dòng)態(tài)查詢表結(jié)構(gòu)相同表名不同數(shù)據(jù)的Model實(shí)現(xiàn)方法,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-08-08
xadmin使用formfield_for_dbfield函數(shù)過(guò)濾下拉表單實(shí)例
這篇文章主要介紹了xadmin使用formfield_for_dbfield函數(shù)過(guò)濾下拉表單實(shí)例,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2020-04-04
Python import用法以及與from...import的區(qū)別
這篇文章主要介紹了Python import用法以及與from...import的區(qū)別,本文簡(jiǎn)潔明了,很容易看懂,需要的朋友可以參考下2015-05-05
Python異常捕獲以及簡(jiǎn)單錯(cuò)誤日志生成方式
這篇文章主要介紹了Python異常捕獲以及簡(jiǎn)單錯(cuò)誤日志生成方式,具有很好的參考價(jià)值,希望對(duì)大家有所幫助,如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2023-09-09
Python實(shí)現(xiàn)電腦喚醒后自動(dòng)拍照截屏并發(fā)郵件通知
解決Pycharm無(wú)法import自己安裝的第三方module問(wèn)題
python+pytest自動(dòng)化測(cè)試函數(shù)測(cè)試類測(cè)試方法的封裝

