Python使用TextRank算法提取關(guān)鍵詞
TextRank 是一種基于 PageRank 的算法,常用于關(guān)鍵詞提取和文本摘要。在本文中,我將通過(guò)一個(gè)關(guān)鍵字提取示例幫助您了解 TextRank 如何工作,并展示 Python 的實(shí)現(xiàn)。
使用 TextRank、NER 等進(jìn)行關(guān)鍵詞提取
1.PageRank簡(jiǎn)介
關(guān)于 PageRank 的文章有很多,我只簡(jiǎn)單介紹一下 PageRank。這將有助于我們稍后理解 TextRank,因?yàn)樗腔?PageRank 的。
PageRank (PR) 是一種用于計(jì)算網(wǎng)頁(yè)權(quán)重的算法。我們可以把所有的網(wǎng)頁(yè)看成一個(gè)大的有向圖。在此圖中,節(jié)點(diǎn)是網(wǎng)頁(yè)。如果網(wǎng)頁(yè) A 有指向網(wǎng)頁(yè) B 的鏈接,則它可以表示為從 A 到 B 的有向邊。
構(gòu)建完整個(gè)圖后,我們可以通過(guò)以下公式為網(wǎng)頁(yè)分配權(quán)重。
這是一個(gè)示例,可以更好地理解上面的符號(hào)。我們有一個(gè)圖表來(lái)表示網(wǎng)頁(yè)如何相互鏈接。每個(gè)節(jié)點(diǎn)代表一個(gè)網(wǎng)頁(yè),箭頭代表邊。我們想得到網(wǎng)頁(yè) e 的權(quán)重。
我們可以將上述函數(shù)中的求和部分重寫為更簡(jiǎn)單的版本。
我們可以通過(guò)下面的函數(shù)得到網(wǎng)頁(yè) e 的權(quán)重。
我們可以看到網(wǎng)頁(yè) e 的權(quán)重取決于其入站頁(yè)面的權(quán)重。我們需要多次運(yùn)行此迭代才能獲得最終權(quán)重。初始化時(shí),每個(gè)網(wǎng)頁(yè)的重要性為 1。
2.PageRank實(shí)現(xiàn)
我們可以用一個(gè)矩陣來(lái)表示圖中 a、b、e、f 之間的入站和出站鏈接。
一行中的每個(gè)節(jié)點(diǎn)表示來(lái)自其他節(jié)點(diǎn)的入站鏈接。例如,對(duì)于 e 行,節(jié)點(diǎn) a 和 b 具有指向節(jié)點(diǎn) e 的出站鏈接。本演示文稿將簡(jiǎn)化更新權(quán)重的計(jì)算。
根據(jù)
?從函數(shù)中,我們應(yīng)該規(guī)范化每一列。
我們使用這個(gè)矩陣乘以所有節(jié)點(diǎn)的權(quán)重。
這只是一次沒(méi)有阻尼系數(shù) d 的迭代。
我們可以使用 Python 進(jìn)行多次迭代。
import numpy as np g = [[0, 0, 0, 0], [0, 0, 0, 0], [1, 0.5, 0, 0], [0, 0.5, 0, 0]] g = np.array(g) pr = np.array([1, 1, 1, 1]) # initialization for a, b, e, f is 1 d = 0.85 for iter in range(10): pr = 0.15 + 0.85 * np.dot(g, pr) print(iter) print(pr)
0
[0.15 0.15 1.425 0.575]
1
[0.15 0.15 0.34125 0.21375]
2
[0.15 0.15 0.34125 0.21375]
3
[0.15 0.15 0.34125 0.21375]
4
[0.15 0.15 0.34125 0.21375]
5
[0.15 0.15 0.34125 0.21375]
6
[0.15 0.15 0.34125 0.21375]
7
[0.15 0.15 0.34125 0.21375]
8
[0.15 0.15 0.34125 0.21375]
9
[0.15 0.15 0.34125 0.21375]
10
[0.15 0.15 0.34125 0.21375]
所以 e 的權(quán)重(PageRank值)為 0.34125。
如果我們把有向邊變成無(wú)向邊,我們就可以相應(yīng)地改變矩陣。
規(guī)范化。
我們應(yīng)該相應(yīng)地更改代碼。
import numpy as np g = [[0, 0, 0.5, 0], [0, 0, 0.5, 1], [1, 0.5, 0, 0], [0, 0.5, 0, 0]] g = np.array(g) pr = np.array([1, 1, 1, 1]) # initialization for a, b, e, f is 1 d = 0.85 for iter in range(10): pr = 0.15 + 0.85 * np.dot(g, pr) print(iter) print(pr)
0
[0.575 1.425 1.425 0.575]
1
[0.755625 1.244375 1.244375 0.755625]
2
[0.67885937 1.32114062 1.32114062 0.67885937]
3
[0.71148477 1.28851523 1.28851523 0.71148477]
4
[0.69761897 1.30238103 1.30238103 0.69761897]
5
[0.70351194 1.29648806 1.29648806 0.70351194]
6
[0.70100743 1.29899257 1.29899257 0.70100743]
7
[0.70207184 1.29792816 1.29792816 0.70207184]
8
[0.70161947 1.29838053 1.29838053 0.70161947]
9
[0.70181173 1.29818827 1.29818827 0.70181173]
所以 e 的權(quán)重(PageRank值)為 1.29818827。
3.TextRank原理
TextRank 和 PageTank 有什么區(qū)別呢?
簡(jiǎn)而言之 PageRank 用于網(wǎng)頁(yè)排名,TextRank 用于文本排名。 PageRank 中的網(wǎng)頁(yè)就是 TextRank 中的文本,所以基本思路是一樣的。
我們將一個(gè)文檔分成幾個(gè)句子,我們只存儲(chǔ)那些帶有特定 POS 標(biāo)簽的詞。我們使用 spaCy 進(jìn)行詞性標(biāo)注。
import spacy nlp = spacy.load('en_core_web_sm') content = ''' The Wandering Earth, described as China's first big-budget science fiction thriller, quietly made it onto screens at AMC theaters in North America this weekend, and it shows a new side of Chinese filmmaking — one focused toward futuristic spectacles rather than China's traditionally grand, massive historical epics. At the same time, The Wandering Earth feels like a throwback to a few familiar eras of American filmmaking. While the film's cast, setting, and tone are all Chinese, longtime science fiction fans are going to see a lot on the screen that reminds them of other movies, for better or worse. ''' doc = nlp(content) for sents in doc.sents: print(sents.text)
我們將段落分成三個(gè)句子。
The Wandering Earth, described as China’s first big-budget science fiction thriller, quietly made it onto screens at AMC theaters in North America this weekend, and it shows a new side of Chinese filmmaking — one focused toward futuristic spectacles rather than China’s traditionally grand, massive historical epics.
At the same time, The Wandering Earth feels like a throwback to a few familiar eras of American filmmaking.
While the film’s cast, setting, and tone are all Chinese, longtime science fiction fans are going to see a lot on the screen that reminds them of other movies, for better or worse.
因?yàn)榫渥又械拇蟛糠衷~對(duì)確定重要性沒(méi)有用,我們只考慮帶有 NOUN、PROPN、VERB POS 標(biāo)簽的詞。這是可選的,你也可以使用所有的單詞。
candidate_pos = ['NOUN', 'PROPN', 'VERB'] sentences = [] for sent in doc.sents: selected_words = [] for token in sent: if token.pos_ in candidate_pos and token.is_stop is False: selected_words.append(token) sentences.append(selected_words) print(sentences)
[[Wandering, Earth, described, China, budget, science, fiction, thriller, screens, AMC, theaters, North, America, weekend, shows, filmmaking, focused, spectacles, China, epics],
[time, Wandering, Earth, feels, throwback, eras, filmmaking],
[film, cast, setting, tone, science, fiction, fans, going, lot, screen, reminds, movies]]
每個(gè)詞都是 PageRank 中的一個(gè)節(jié)點(diǎn)。我們將窗口大小設(shè)置為 k。
[ w 1 , w 2 , … , w k ] , [ w 2 , w 3 , … , w k + 1 ] , [ w 3 , w 4 , … , w k + 2 ] [w1, w2, …, w_k], [w2, w3, …, w_{k+1}], [w3, w4, …, w_{k+2}] [w1,w2,…,wk?],[w2,w3,…,wk+1?],[w3,w4,…,wk+2?] 是窗口。窗口中的任何兩個(gè)詞對(duì)都被認(rèn)為具有無(wú)向邊。
我們以 [time, wandering, earth, feels, throwback, era, filmmaking]
為例,設(shè)置窗口大小 k = 4 k=4 k=4,所以得到 4 個(gè)窗口,[time, Wandering, Earth, feels]
,[Wandering, Earth, feels, throwback]
,[Earth, feels, throwback, eras]
,[feels, throwback, eras, filmmaking]
。
對(duì)于窗口 [time, Wandering, Earth, feels]
,任何兩個(gè)詞對(duì)都有一條無(wú)向邊。所以我們得到 (time, Wandering)
,(time, Earth)
,(time, feels)
,(Wandering, Earth)
,(Wandering, feels)
,(Earth, feels)
。
基于此圖,我們可以計(jì)算每個(gè)節(jié)點(diǎn)(單詞)的權(quán)重。最重要的詞可以用作關(guān)鍵字。
4.TextRank提取關(guān)鍵詞
這里我用 Python 實(shí)現(xiàn)了一個(gè)完整的例子,我們使用 spaCy 來(lái)獲取詞的詞性標(biāo)簽。
from collections import OrderedDict import numpy as np import spacy from spacy.lang.en.stop_words import STOP_WORDS nlp = spacy.load('en_core_web_sm') class TextRank4Keyword(): """Extract keywords from text""" def __init__(self): self.d = 0.85 # damping coefficient, usually is .85 self.min_diff = 1e-5 # convergence threshold self.steps = 10 # iteration steps self.node_weight = None # save keywords and its weight def set_stopwords(self, stopwords): """Set stop words""" for word in STOP_WORDS.union(set(stopwords)): lexeme = nlp.vocab[word] lexeme.is_stop = True def sentence_segment(self, doc, candidate_pos, lower): """Store those words only in cadidate_pos""" sentences = [] for sent in doc.sents: selected_words = [] for token in sent: # Store words only with cadidate POS tag if token.pos_ in candidate_pos and token.is_stop is False: if lower is True: selected_words.append(token.text.lower()) else: selected_words.append(token.text) sentences.append(selected_words) return sentences def get_vocab(self, sentences): """Get all tokens""" vocab = OrderedDict() i = 0 for sentence in sentences: for word in sentence: if word not in vocab: vocab[word] = i i += 1 return vocab def get_token_pairs(self, window_size, sentences): """Build token_pairs from windows in sentences""" token_pairs = list() for sentence in sentences: for i, word in enumerate(sentence): for j in range(i+1, i+window_size): if j >= len(sentence): break pair = (word, sentence[j]) if pair not in token_pairs: token_pairs.append(pair) return token_pairs def symmetrize(self, a): return a + a.T - np.diag(a.diagonal()) def get_matrix(self, vocab, token_pairs): """Get normalized matrix""" # Build matrix vocab_size = len(vocab) g = np.zeros((vocab_size, vocab_size), dtype='float') for word1, word2 in token_pairs: i, j = vocab[word1], vocab[word2] g[i][j] = 1 # Get Symmeric matrix g = self.symmetrize(g) # Normalize matrix by column norm = np.sum(g, axis=0) g_norm = np.divide(g, norm, where=norm!=0) # this is ignore the 0 element in norm return g_norm def get_keywords(self, number=10): """Print top number keywords""" node_weight = OrderedDict(sorted(self.node_weight.items(), key=lambda t: t[1], reverse=True)) for i, (key, value) in enumerate(node_weight.items()): print(key + ' - ' + str(value)) if i > number: break def analyze(self, text, candidate_pos=['NOUN', 'PROPN'], window_size=4, lower=False, stopwords=list()): """Main function to analyze text""" # Set stop words self.set_stopwords(stopwords) # Pare text by spaCy doc = nlp(text) # Filter sentences sentences = self.sentence_segment(doc, candidate_pos, lower) # list of list of words # Build vocabulary vocab = self.get_vocab(sentences) # Get token_pairs from windows token_pairs = self.get_token_pairs(window_size, sentences) # Get normalized matrix g = self.get_matrix(vocab, token_pairs) # Initionlization for weight(pagerank value) pr = np.array([1] * len(vocab)) # Iteration previous_pr = 0 for epoch in range(self.steps): pr = (1-self.d) + self.d * np.dot(g, pr) if abs(previous_pr - sum(pr)) < self.min_diff: break else: previous_pr = sum(pr) # Get weight for each node node_weight = dict() for word, index in vocab.items(): node_weight[word] = pr[index] self.node_weight = node_weight
這個(gè) TextRank4Keyword
實(shí)現(xiàn)了前文描述的相關(guān)功能。我們可以看到一段的輸出。
text = ''' The Wandering Earth, described as China's first big-budget science fiction thriller, quietly made it onto screens at AMC theaters in North America this weekend, and it shows a new side of Chinese filmmaking — one focused toward futuristic spectacles rather than China's traditionally grand, massive historical epics. At the same time, The Wandering Earth feels like a throwback to a few familiar eras of American filmmaking. While the film's cast, setting, and tone are all Chinese, longtime science fiction fans are going to see a lot on the screen that reminds them of other movies, for better or worse. ''' tr4w = TextRank4Keyword() tr4w.analyze(text, candidate_pos = ['NOUN', 'PROPN'], window_size=4, lower=False) tr4w.get_keywords(10)
science - 1.717603106506989
fiction - 1.6952610926181002
filmmaking - 1.4388798751402918
China - 1.4259793786986021
Earth - 1.3088154732297723
tone - 1.1145002295684114
Chinese - 1.0996896235078055
Wandering - 1.0071059904601571
weekend - 1.002449354657688
America - 0.9976329264870932
budget - 0.9857269586649321
North - 0.9711240881032547
到此這篇關(guān)于Python使用TextRank算法提取關(guān)鍵詞的文章就介紹到這了,更多相關(guān)Python TextRank內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
- python實(shí)現(xiàn)textrank關(guān)鍵詞提取
- Python超簡(jiǎn)單分析評(píng)論提取關(guān)鍵詞制作精美詞云流程
- Python實(shí)現(xiàn)提取Excel指定關(guān)鍵詞的行數(shù)據(jù)
- python 利用百度API進(jìn)行淘寶評(píng)論關(guān)鍵詞提取
- python TF-IDF算法實(shí)現(xiàn)文本關(guān)鍵詞提取
- python多進(jìn)程提取處理大量文本的關(guān)鍵詞方法
- python實(shí)現(xiàn)關(guān)鍵詞提取的示例講解
- python提取內(nèi)容關(guān)鍵詞的方法
相關(guān)文章
Python實(shí)現(xiàn)圖片批量加入水印代碼實(shí)例
這篇文章主要介紹了Python實(shí)現(xiàn)圖片批量加入水印代碼實(shí)例,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-11-11python開(kāi)發(fā)簡(jiǎn)單的命令行工具簡(jiǎn)介
這篇文章主要介紹了python開(kāi)發(fā)簡(jiǎn)單的命令行工具實(shí)例的相關(guān)資料,需要的朋友可以參考下2023-02-02Python模擬登錄之滑塊驗(yàn)證碼的破解(實(shí)例代碼)
這篇文章主要介紹了Python模擬登錄之滑塊驗(yàn)證碼的破解(實(shí)例代碼),代碼簡(jiǎn)單易懂,非常不錯(cuò),具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2019-11-11關(guān)于Python下的Matlab函數(shù)對(duì)應(yīng)關(guān)系(Numpy)
這篇文章主要介紹了關(guān)于Python下的Matlab函數(shù)對(duì)應(yīng)關(guān)系(Numpy),具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2022-07-07利用Python在一個(gè)文件的頭部插入數(shù)據(jù)的實(shí)例
下面小編就為大家分享一篇利用Python在一個(gè)文件的頭部插入數(shù)據(jù)的實(shí)例,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2018-05-05Python常見(jiàn)庫(kù)matplotlib學(xué)習(xí)筆記之多個(gè)子圖繪圖
Matplotlib是Python提供的一個(gè)繪圖庫(kù),通過(guò)該庫(kù)我們可以很容易的繪制出折線圖、直方圖、散點(diǎn)圖、餅圖等豐富的統(tǒng)計(jì)圖,下面這篇文章主要給大家介紹了關(guān)于Python常見(jiàn)庫(kù)matplotlib學(xué)習(xí)筆記之多個(gè)子圖繪圖的相關(guān)資料,需要的朋友可以參考下2023-05-05詳解Python的Flask框架中生成SECRET_KEY密鑰的方法
密鑰值的生成功能十分重要,幾乎也是各大Web開(kāi)發(fā)框架的標(biāo)配,Flask當(dāng)然也不例外,這里我們就來(lái)詳解Python的Flask框架中生成SECRET_KEY密鑰的方法2016-06-06