基于Python和TFIDF實(shí)現(xiàn)提取文本中的關(guān)鍵詞

更新時(shí)間：2022年04月25日 11:09:04 作者：云朵君

TFIDF 的工作原理是按比例增加一個(gè)詞語在文檔中出現(xiàn)的次數(shù)，但會(huì)被它所在的文檔數(shù)量抵消。本文將利用TFIDF實(shí)現(xiàn)提取文本中的關(guān)鍵詞，感興趣的小伙伴快跟隨小編一起學(xué)習(xí)一下吧

前言

關(guān)鍵詞提取是從簡明概括長文本內(nèi)容的文檔中，自動(dòng)提取一組代表性短語。關(guān)鍵詞是一個(gè)簡短的短語（通常是一到三個(gè)單詞），高度概括了文檔的關(guān)鍵思想并反映一個(gè)文檔的內(nèi)容，清晰反映討論的主題并提供其內(nèi)容的摘要。

關(guān)鍵字/短語提取過程包括以下步驟：

預(yù)處理：文檔處理以消除噪音。
形成候選tokens：形成 n-gram tokens作為候選關(guān)鍵字。
關(guān)鍵字加權(quán)：使用向量器 TFIDF 計(jì)算每個(gè) n-gram token (關(guān)鍵短語) 的 TFIDF 權(quán)重。
排序：根據(jù) TFIDF 權(quán)重對候選詞進(jìn)行降序排列。
選擇前 N 個(gè)關(guān)鍵字。

詞頻逆文檔頻率（TFIDF）

TFIDF 的工作原理是按比例增加一個(gè)詞語在文檔中出現(xiàn)的次數(shù)，但會(huì)被它所在的文檔數(shù)量抵消。因此，諸如“這個(gè)”、“是”等在所有文檔中普遍出現(xiàn)的詞沒有被賦予很高的權(quán)重。但是，在少數(shù)文檔中出現(xiàn)太多次的單詞將被賦予更高的權(quán)重排名，因?yàn)樗芸赡苁侵甘疚臋n的上下文。

Term Frequency

Term Frequency --> 詞頻

詞頻定義為單詞 (i) 在文檔 (j) 中出現(xiàn)的次數(shù)除以文檔中的總單詞數(shù)。

Inverse Document Frequency

Inverse Document Frequency --> 逆文檔頻率

逆文檔頻率是指文檔總數(shù)除以包含該單詞的文檔數(shù)的對數(shù)。添加對數(shù)是為了抑制非常高的 IDF 值的重要性。

TFIDF

TFIDF是通過將詞頻乘以逆文檔頻率來計(jì)算的。

Python 中的 TFIDF

我們可以使用 sklearn 庫輕松執(zhí)行 TFIDF 向量化。

from?sklearn.feature_extraction.text?import?TfidfVectorizer
vectorizer?=?TfidfVectorizer()
X?=?vectorizer.fit_transform(corpus)
print(X.toarray())

Python 庫準(zhǔn)備

import?spacy
import?nltk
from?nltk.tokenize?import?word_tokenize
from?nltk.corpus?import?stopwords
import?regex?as?re
import?string
import?pandas?as?pd
import?numpy?as?np
import?nltk.data
import?re
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from?nltk.stem?import?WordNetLemmatizer
from?nltk?import?word_tokenize,?sent_tokenize,?pos_tag

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!

主要使用的是nltk庫，如果你沒有使用過該庫，除了需要pip install nltk，另外還要下載諸如停用詞等?；蛘咧苯拥焦倬W(wǎng)上把整個(gè)nltk_data下載下來。

準(zhǔn)備數(shù)據(jù)集

將使用 Theses100 標(biāo)準(zhǔn)數(shù)據(jù)集[1]來評估關(guān)鍵字提取方法。這 100 個(gè)數(shù)據(jù)集由新西蘭懷卡托大學(xué)的 100 篇完整的碩士和博士論文組成。這里使用一個(gè)只包含 99 個(gè)文件的版本。刪除其余不包含關(guān)鍵字打文件。論文主題非常多樣化：從化學(xué)、計(jì)算機(jī)科學(xué)和經(jīng)濟(jì)學(xué)到心理學(xué)、哲學(xué)、歷史等。每個(gè)文檔的平均重要關(guān)鍵字?jǐn)?shù)約為 7.67。

你可以將所需的數(shù)據(jù)集下載到本地。本文已經(jīng)假設(shè)你電腦本地已經(jīng)存在該數(shù)據(jù)文件。將編寫一個(gè)函數(shù)來檢索文檔及其關(guān)鍵字并將輸出存儲(chǔ)為數(shù)據(jù)框。

為了演示，我們只選擇了其中20個(gè)文檔。

import?os????????
path?=?"./data/theses100/"?????
all_files?=?os.listdir(path?+?"docsutf8")?
all_keys?=?os.listdir(path?+?"keys")?
print(len(all_files),"?files?n",all_files,
??????"n",?all_keys)?#?不一定要排序

all_documents?=[]?
all_keys?=?[]?
all_files_names?=?[]?
for?i,?fname?in?enumerate(all_files):?
??with?open(path+'docsutf8/'+fname)?as?f:?
??????lines?=?f.readlines()?
??key_name=?fname[:-4?]?
??with?open(path+'keys/'+key_name+'.key')?as?f:?
??????k?=?f.readlines()?
??all_text?=?'?'.join(lines)?
??keyss?=?'?'.join(k)?
??all_documents.append(all_text)?
??all_keys.append(keyss.split("n"))?
??all_files_names.append(key_name)
??
import?pandas?as?pd
dtf?=?pd.DataFrame({'goldkeys':?all_keys,
????????????????????'text':?all_documents})
dtf.head()

文本預(yù)處理

預(yù)處理包括標(biāo)記化、詞形還原、小寫轉(zhuǎn)換、去除數(shù)字、去除空格、去除短于三個(gè)字母的單詞、去除停用詞、去除符號(hào)和標(biāo)點(diǎn)符號(hào)。實(shí)現(xiàn)這些功能的函數(shù)定義為preprocess_text，我附在文末，按需查看。

對于詞形還原，使用了 WordNetLemmatizer 它不會(huì)改變單詞的詞根。

dtf['cleaned_text']?=?dtf.text.apply(lambda?x:?'?'.join(preprocess_text(x)))
dtf.head()

之后，清理每個(gè)文檔的 goldkeys 并執(zhí)行詞形還原，以便稍后與TFIDF使用Python算法生成的單詞進(jìn)行匹配。

#?清理基本關(guān)鍵字，刪除空格和噪音
def?clean_orginal_kw(orginal_kw):
??orginal_kw_clean?=[]
??for?doc_kw?in?orginal_kw:
????temp?=[]
????for?t?in?doc_kw:
??????tt?=?'?'.join(preprocess_text(t))
??????if?len(tt.split())>0:
????????temp.append(tt)
????orginal_kw_clean.append(temp)
??return?orginal_kw_clean

orginal_kw=?clean_orginal_kw(dtf['goldkeys'])
orginal_kw[0:1]

TFIDF關(guān)鍵詞提取

1.生成 n-gram 并對其進(jìn)行加權(quán)

首先，從文本特征提取包中導(dǎo)入 Tfidf Vectorizer。

其次，設(shè)置參數(shù) use_idf=True ，即希望將逆文檔頻率 IDF 與詞頻一起使用。它的最大值是 max_df = 0.5，這意味著我們只想要出現(xiàn)在 50% 的文檔中的詞條（本文中，對應(yīng) 99 個(gè)中的 49 個(gè)文檔）。如果一個(gè)詞語在超過 50 個(gè)文檔中均出現(xiàn)過，它將被刪除，因?yàn)樗谡Z料庫級(jí)別被認(rèn)為是無歧視性的。指定n-gram的范圍從1到3（可以設(shè)置更大的數(shù)字，但是根據(jù)當(dāng)前數(shù)據(jù)集的統(tǒng)計(jì)，最大的比例是1-3長度的關(guān)鍵字）

然后生成文檔的向量。

from?sklearn.feature_extraction.text?import?TfidfVectorizer
vectorizer?=?TfidfVectorizer(use_idf=True,?
?????????????????????????????max_df=0.5,?min_df=1,?
?????????????????????????????ngram_range=(1,3))
vectors?=?vectorizer.fit_transform(dtf['cleaned_text'])

再者，對于每個(gè)文檔，均需要構(gòu)建一個(gè)字典（ dict_of_tokens ），其中鍵是單詞，值是 TFIDF 權(quán)重。創(chuàng)建一個(gè)tfidf_vectors列表來存儲(chǔ)所有文檔的字典。

dict_of_tokens={i[1]:i[0]?for?i?in?vectorizer.vocabulary_.items()}

tfidf_vectors?=?[]??#?all?deoc?vectors?by?tfidf
for?row?in?vectors:
????tfidf_vectors.append({dict_of_tokens[column]:value?for
??????????????????????????(column,value)?in?
??????????????????????????zip(row.indices,row.data)})

看看這個(gè)字典包含的第一個(gè)文檔

print("The?number?of?document?vectors?=?",?len(tfidf_vectors)?,?
??????"\nThe?dictionary?of?document[0]?:",?tfidf_vectors[0])

第一個(gè)文檔的字典內(nèi)容

字典的數(shù)量與文檔的數(shù)量相同，第一個(gè)文檔的字典包含每個(gè) n-gram 及其 TFIDF 權(quán)重。

2. 按 TFIDF 權(quán)重對關(guān)鍵短語進(jìn)行排序

下一步是簡單地根據(jù) TFIDF 權(quán)重對每個(gè)字典中的 n-gram 進(jìn)行降序排序。設(shè)置 reverse=True 選擇降序排序。

doc_sorted_tfidfs?=[]??#?帶有tfidf權(quán)重的文檔特征列表
#?對文檔的每個(gè)字典進(jìn)行排序
for?dn?in?tfidf_vectors:
????newD?=?sorted(dn.items(),?key=lambda?x:?x[1],?reverse=True)
????newD?=?dict(newD)
????doc_sorted_tfidfs.append(newD)

現(xiàn)在可以獲得沒有權(quán)重的關(guān)鍵字列表

tfidf_kw?=?[]
for?doc_tfidf?in?doc_sorted_tfidfs:
????ll?=?list(doc_tfidf.keys())
????tfidf_kw.append(ll)

為第一個(gè)文檔選擇前 5 個(gè)關(guān)鍵字。

TopN=?5
print(tfidf_kw[0][0:TopN])

['cone', 'cone tree', 
'dimensional', 'shadow',
'visualization']

性能評估

以上方法足以使用其提取關(guān)鍵詞或關(guān)鍵短語，但在下文中，希望根據(jù)此類任務(wù)的標(biāo)準(zhǔn)度量，以科學(xué)的方式評估該方法的有效性。

首先使用精確匹配進(jìn)行評估，從文檔中自動(dòng)提取的關(guān)鍵短語必須與文檔的黃金標(biāo)準(zhǔn)關(guān)鍵字完全匹配。

def?get_exact_intersect(doc_orginal_kw,?doc_my_kw):
????general?=?[]
????for?kw?in?doc_my_kw:
????????for?kww?in?doc_orginal_kw:
????????????l_my?=?len(kw.split())
????????????l_org?=?len(kww.split())
????????????if?(kw?==?kww):
#?????????????????print("exact?matching?========",?kw,?kww)
????????????????if?kww?not?in?general:
????????????????????general.append(kww)?
????return?general

get_exact_intersect(orginal_kw[0],?tfidf_kw[0])

['visualization',
 'animation',
 'unix',
 'dimension',
 'cod',
 'icon',
 'shape',
 'fisheye lens',
 'rapid prototyping',
 'script language',
 'tree structure',
 'programming language']

關(guān)鍵字提取是一個(gè)排名問題。最常用的排名度量之一是"Mean average precision at K(K處的平均精度), MAP@K"。為了計(jì)算MAP@K ，首先將 " precision at K elements(k處的精度), p@k "視為一個(gè)文檔的排名質(zhì)量的基本指標(biāo)。

def?apk(kw_actual,?kw_predicted,?k=10):
????if?len(kw_predicted)>k:
????????kw_predicted?=?kw_predicted[:k]
????score?=?0.0
????num_hits?=?0.0
????for?i,p?in?enumerate(kw_predicted):
????????if?p?in?kw_actual?and?p?not?in?kw_predicted[:i]:
????????????num_hits?+=?1.0
????????????score?+=?num_hits?/?(i+1.0)
????if?not?kw_actual:
????????return?0.0
????return?score?/?min(len(kw_actual),?k)

def?mapk(kw_actual,?kw_predicted,?k=10):
????return?np.mean([apk(a,p,k)?for?a,p?in?zip(kw_actual,?kw_predicted)])

此函數(shù)apk接受兩個(gè)參數(shù)：TFIDF 方法預(yù)測的關(guān)鍵字列表（kw_predicted）和黃金標(biāo)準(zhǔn)關(guān)鍵字列表（kw_actual）。k 的默認(rèn)值為 10。這里在 k=[5,10,20,40] 處打印 MAP 值。

for?k?in?[5,?10,20,40]:
????mpak=?mapk(orginal_kw,?tfidf_kw,?k)
????print("mean?average?precession??@",k,
??????????'=??{0:.4g}'.format(mpak))

mean average precession  @ 5 =  0.2037
mean average precession  @ 10 =  0.1379
mean average precession  @ 20 =  0.08026
mean average precession  @ 40 =  0.05371

在本文中，我們介紹了一種使用TFIDF和Python從文檔中提取關(guān)鍵字的簡單方法。用Python編寫代碼并逐步解釋。將MAP標(biāo)準(zhǔn)作為一個(gè)排序任務(wù)來評價(jià)該方法的性能。這種方法雖然簡單，但非常有效，被認(rèn)為是該領(lǐng)域的有力基線之一。

附錄

文本預(yù)處理preprocess_text函數(shù)。

def?preprocess_text(text):
????#?1.?將其標(biāo)記為字母符號(hào)
????text?=?remove_numbers(text)
????text?=?remove_http(text)
????text?=?remove_punctuation(text)
????text?=?convert_to_lower(text)
????text?=?remove_white_space(text)
????text?=?remove_short_words(text)
????tokens?=?toknizing(text)
????#?2.?POS?tagging
????pos_map?=?{'J':?'a',?'N':?'n',?'R':?'r',?'V':?'v'}
????pos_tags_list?=?pos_tag(tokens)
????#?print(pos_tags)
????#?3.?小寫變換和詞形還原?
????lemmatiser?=?WordNetLemmatizer()
????tokens?=?[lemmatiser.lemmatize(w.lower(),?
???????????????????????????????????pos=pos_map.get(p[0],?'v'))?
??????????????for?w,?p?in?pos_tags_list]
????return?tokens

def?convert_to_lower(text):
????#?小寫轉(zhuǎn)換
????return?text.lower()

def?remove_numbers(text):
????#?除去數(shù)字
????text?=?re.sub(r'\d+'?,?'',?text)
????return?text

def?remove_http(text):
????#?除去網(wǎng)址
????text?=?re.sub("https?:\/\/t.co\/[A-Za-z0-9]*",?'?',?text)
????return?text

def?remove_short_words(text):
????#?去除短于三個(gè)字母的單詞
????text?=?re.sub(r'\b\w{1,2}\b',?'',?text)
????return?text

def?remove_punctuation(text):
????#?去除符號(hào)和標(biāo)點(diǎn)符號(hào)
????punctuations?=?'''!()[]{};?№?:'"\,`<>./?@=#$-(%^)+&[*_]~'''
????no_punct?=?""
?????
????for?char?in?text:
????????if?char?not?in?punctuations:
????????????no_punct?=?no_punct?+?char
????return?no_punct

def?remove_white_space(text):
????#?去除空格
????text?=?text.strip()
????return?text

def?toknizing(text):
????stp?=?my_stopwords
????#stops?=?set(stopwords.words('english'))
????stop_words?=?set(stp)
????tokens?=?word_tokenize(text)
????##?從tokens中去除停用詞
????result?=?[i?for?i?in?tokens?if?not?i?in?stop_words]
????return?result

以上就是基于Python和TFIDF實(shí)現(xiàn)提取文本中的關(guān)鍵詞的詳細(xì)內(nèi)容，更多關(guān)于Python TFIDF提取文本關(guān)鍵詞的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: