Python利用卡方Chi特征檢驗實現(xiàn)提取關鍵文本特征

更新時間：2022年12月01日 15:30:07 作者：Toblerone_Wind

卡方檢驗最基本的思想就是通過觀察實際值與理論值的偏差來確定理論的正確與否。本文將利用卡方Chi特征檢驗實現(xiàn)提取關鍵文本特征功能，感興趣的可以了解一下

理論

	類別class_i	非類別class_i
包含單詞word_j的文檔數(shù)	A	B
不包含單詞word_j的文檔數(shù)	C	D

卡方特征提取主要度量類別class_i和單詞word_j之間的依賴關系。計算公式如下

其中N是文檔總數(shù)，A是包含單詞word_j且屬于class_i的文檔數(shù)，B是包含單詞word_j但不屬class_i的文檔數(shù)，C是不包含單詞word_j但屬于class_i的文檔數(shù)，D是不包含單詞word_j且不屬于class_i的文檔數(shù)。值得注意的是

最終單詞word_j的CHI值計算公式如下，其中P(class_i)表示屬于類別 class_i的文檔在所有文檔中出現(xiàn)的概率，k為總的類別數(shù)

代碼

下面以二分類為例介紹一段python代碼：第一個參數(shù)是文檔列表，包含若干個文檔，每個文檔由若干個單詞通過空格拼接而成；第二個參數(shù)是標簽列表，對應每個文檔的類別；第三個參數(shù)用來確定選取前百分之多少的單詞。

# documents = [document_1, document_2, document_3, ...]
# document_i = "word_1 word_2 word_3" 
# labels is a list combined with 0 and 1
def feature_word_select(documents:list, labels:list, percentage:float):
 
    # get all words      
    word_set = set()
    for document in documents:
        words = document.split()
        word_set.update(words)
    word_list = list(word_set)
    word_list.sort()
 
    sorted_words = chi(word_list, documents, labels)
    top_k_words = sorted_words[:int(percentage * len(sorted_words))]
    
    return top_k_words

這段代碼首先創(chuàng)建一個集合word_set，接著遍歷所有的文檔，對每一個文檔，使用split()函數(shù)對其進行切分，得到一個words列表，再將列表中的所有元素輸入到集合word_set中，word_set由于集合的特性會過濾集合中已有的單詞。收集完畢后，通過word_set生成一個單詞列表word_list。

將單詞列表，文檔列表和標簽列表輸入chi函數(shù)，得到通過卡方值降序排列的單詞列表sorted_words。

最后選取前百分之percentage的單詞最為最后的特征單詞。

下面這個函數(shù)cal_chi_word_class()用來計算 CHI(word, 0)和CHI(word, 1)。這里的A1表示屬于類別1的A，A0表示屬于類別0的A。

值得說明的是，在二分類問題中，A1實際上等于B0，C1實際上等于D0。因此，僅計算A1,B1,C1,D1即可推導出A0,B0,C0,D0。

此外，由于文檔總數(shù)N對于CHI(word, 0)和CHI(word, 1)來說屬于公共的分子且保持不變，所以可以不參與計算；A1+C1=B0+D0，B1+D1=A0+C0，所以CHI(word, 0)和CHI(word, 1)的分母部分可以進行簡化

# calculate chi(word,1) and chi(word,0)
def cal_chi_word_class(word, labels, documents):
    N = len(documents)
    A1, B1, C1, D1 = 0., 0., 0., 0.
    A0, B0, C0, D0 = 0., 0., 0., 0.
    for i in range(len(documents)):
        if word in documents[i].split():
            if labels[i] == 1:
                A1 += 1
                B0 += 1
            else:
                B1 += 1
                A0 += 1
        else:
            if labels[i] == 1:
                C1 += 1
                D0 += 1
            else:
                D1 += 1
                C0 += 1
    chi_word_1 = N * (A1*D1-C1*B1)**2 / ((A1+C1)*(B1+D1)*(A1+B1)*(C1+D1))
    chi_word_0 = N * (A0*D0-C0*B0)**2 / ((A0+C0)*(B0+D0)*(A0+B0)*(C0+D0))
    return chi_word_1, chi_word_0

簡化后

# calculate chi(word,1) and chi(word,0)
def cal_chi_word_class(word, labels, documents):
    A1, B1, C1, D1 = 0., 0., 0., 0.
    for i in range(len(documents)):
        if word in documents[i].split():
            if labels[i] == 1:
                A1 += 1
            else:
                B1 += 1
        else:
            if labels[i] == 1:
                C1 += 1
            else:
                D1 += 1
    A0, B0, C0, D0 = B1, A1, D1, C1
    chi_word_1 = (A1*D1-C1*B1)**2 / ((A1+B1)*(C1+D1))
    chi_word_0 = (A0*D0-C0*B0)**2 / ((A0+B0)*(C0+D0))
    return chi_word_1, chi_word_0

在chi函數(shù)中調(diào)用cal_chi_word_class函數(shù)，即可計算每個單詞的卡方值，以字典的形式保存每個單詞的卡方值，最后對字典的所有值進行排序，并提取出排序后的單詞。

def chi(word_list, documents, labels):
    P1 = labels.count(1) / len(documents)
    P0 = 1 - P1
    dic = {}
    for word in word_list:
        chi_word_1, chi_word_0 = cal_chi_word_class(word, labels, documents)
        chi_word = P0 * chi_word_0 + P1 * chi_word_1
        dic[word] = chi_word
    sorted_list = sorted(dic.items(), key=lambda x:x[1], reverse=True)
    sorted_chi_word = [x[0] for x in sorted_list]
    return sorted_chi_word

測試代碼。這里我略過了數(shù)據(jù)處理環(huán)節(jié)，documents列表中的每一個元素document_i都是有若干個單詞或符號通過空格拼接而成。

def main():
    documents = ["today i am happy !", "she is not happy at all", "let us go shopping !",
        "mike was so sad last night", "amy did not love it", "it is so amazing !"
    ]
    labels = [1, 0, 1, 0, 0, 1]
    words = feature_word_select(documents, labels, 0.3)
    print(words)
 
if __name__ == '__main__':
    main()

運行結果如下

['!', 'not', 'all', 'am', 'amazing', 'amy', 'at']

進一步，可以在chi函數(shù)里輸出sorted_list（每個單詞對應的卡方值），結果如下。這里輸出的并不是真實的卡方值，是經(jīng)過化簡的，如需輸出原始值，請使用完整版的cal_chi_word_class()函數(shù)。

[('!', 9.0), ('not', 4.5), ('all', 1.8), ('am', 1.8), ('amazing', 1.8), ('amy', 1.8), ('at', 1.8), ('did', 1.8), ('go', 1.8), ('i', 1.8), ('last', 1.8), ('let', 1.8), ('love', 1.8), ('mike', 1.8), ...]

完整代碼

# calculate chi(word,1) and chi(word,0)
def cal_chi_word_class(word, labels, documents):
    A1, B1, C1, D1 = 0., 0., 0., 0.
    for i in range(len(documents)):
        if word in documents[i].split():
            if labels[i] == 1:
                A1 += 1
            else:
                B1 += 1
        else:
            if labels[i] == 1:
                C1 += 1
            else:
                D1 += 1
    A0, B0, C0, D0 = B1, A1, D1, C1
    chi_word_1 = (A1*D1-C1*B1)**2 / ((A1+B1)*(C1+D1))
    chi_word_0 = (A0*D0-C0*B0)**2 / ((A0+B0)*(C0+D0))
    return chi_word_1, chi_word_0
 
def chi(word_list, documents, labels):
    P1 = labels.count(1) / len(documents)
    P0 = 1 - P1
    dic = {}
    for word in word_list:
        chi_word_1, chi_word_0 = cal_chi_word_class(word, labels, documents)
        chi_word = P0 * chi_word_0 + P1 * chi_word_1
        dic[word] = chi_word
    sorted_list = sorted(dic.items(), key=lambda x:x[1], reverse=True)
    sorted_chi_word = [x[0] for x in sorted_list]
    return sorted_chi_word
 
# documents = [document_1, document_2, document_3, ...]
# document_i = "word_1 word_2 word_3" 
# labels is a list combined with 0 and 1
def feature_word_select(documents:list, labels:list, percentage:float):
    # get all words      
    word_set = set()
    for document in documents:
        words = document.split()
        word_set.update(words)
    word_list = list(word_set)
    word_list.sort()
 
    sorted_words = chi(word_list, documents, labels)
    top_k_words = sorted_words[:int(percentage * len(sorted_words))]
    
    return top_k_words
 
def main():
    documents = ["today i am happy !", "she is not happy at all", "let us go shopping !",
        "mike was so sad last night", "amy did not love it", "it is so amazing !"
    ]
    labels = [1, 0, 1, 0, 0, 1]
    words = feature_word_select(documents, labels, 0.3)
    print(words)
 
if __name__ == '__main__':
    main()