Python3中英文關(guān)鍵詞提取的幾個(gè)方法
本文將介紹一些簡單的使用Python3實(shí)現(xiàn)關(guān)鍵詞提取的算法。目前僅整理了一些比較簡單的方法,如后期將了解更多、更前沿的算法,會(huì)繼續(xù)更新本文。
1. 基于TF-IDF算法的中文關(guān)鍵詞提?。菏褂胘ieba包實(shí)現(xiàn)
extracted_sentences="隨著企業(yè)持續(xù)產(chǎn)生的商品銷量,其數(shù)據(jù)對于自身營銷規(guī)劃、市場分析、物流規(guī)劃都有重要意義。但是銷量預(yù)測的影響因素繁多,傳統(tǒng)的基于統(tǒng)計(jì)的計(jì)量模型,比如時(shí)間序列模型等由于對現(xiàn)實(shí)的假設(shè)情況過多,導(dǎo)致預(yù)測結(jié)果較差。因此需要更加優(yōu)秀的智能AI算法,以提高預(yù)測的準(zhǔn)確性,從而助力企業(yè)降低庫存成本、縮短交貨周期、提高企業(yè)抗風(fēng)險(xiǎn)能力。" import jieba.analyse print(jieba.analyse.extract_tags(extracted_sentences, topK=20, withWeight=False, allowPOS=()))
輸出:
Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.457 seconds. Prefix dict has been built successfully. ['預(yù)測', '模型', '銷量', '降低庫存', '企業(yè)', 'AI', '規(guī)劃', '提高', '準(zhǔn)確性', '助力', '交貨', '算法', '計(jì)量', '序列', '較差', '繁多', '過多', '假設(shè)', '縮短', '營銷']
函數(shù)入?yún)ⅲ?/p>
topK
:返回TF-IDF權(quán)重最大的關(guān)鍵詞的數(shù)目(默認(rèn)值為20)withWeight
是否一并返回關(guān)鍵詞權(quán)重值,默認(rèn)值為 FalseallowPOS
僅包括指定詞性的詞,默認(rèn)值為空,即不篩選
關(guān)鍵詞提取所使用逆向文件頻率(IDF)文本語料庫可以切換成自定義語料庫的路徑:
用法: jieba.analyse.set_idf_path(file_name)
# file_name為自定義語料庫的路徑
自定義語料庫示例:https://github.com/fxsjy/jieba/blob/master/extra_dict/idf.txt.big
用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_idfpath.py
關(guān)鍵詞提取所使用停止詞(Stop Words)文本語料庫可以切換成自定義語料庫的路徑:
用法: jieba.analyse.set_stop_words(file_name)
# file_name為自定義語料庫的路徑
自定義語料庫示例:https://github.com/fxsjy/jieba/blob/master/extra_dict/stop_words.txt
用法示例:https://github.com/fxsjy/jieba/blob/master/test/extract_tags_stop_words.py
2. 基于TextRank算法的中文關(guān)鍵詞提取:使用jieba包實(shí)現(xiàn)
extracted_sentences="隨著企業(yè)持續(xù)產(chǎn)生的商品銷量,其數(shù)據(jù)對于自身營銷規(guī)劃、市場分析、物流規(guī)劃都有重要意義。但是銷量預(yù)測的影響因素繁多,傳統(tǒng)的基于統(tǒng)計(jì)的計(jì)量模型,比如時(shí)間序列模型等由于對現(xiàn)實(shí)的假設(shè)情況過多,導(dǎo)致預(yù)測結(jié)果較差。因此需要更加優(yōu)秀的智能AI算法,以提高預(yù)測的準(zhǔn)確性,從而助力企業(yè)降低庫存成本、縮短交貨周期、提高企業(yè)抗風(fēng)險(xiǎn)能力。" import jieba.analyse print(jieba.analyse.textrank(extracted_sentences, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v')))
輸出:
Building prefix dict from the default dictionary ... Loading model from cache /tmp/jieba.cache Loading model cost 0.451 seconds. Prefix dict has been built successfully. ['企業(yè)', '預(yù)測', '模型', '規(guī)劃', '提高', '銷量', '比如', '時(shí)間', '市場', '分析', '降低庫存', '成本', '縮短', '交貨', '影響', '因素', '情況', '計(jì)量', '現(xiàn)實(shí)', '數(shù)據(jù)']
入?yún)⒑偷谝还?jié)中的入?yún)⑾嗤?,?code>allowPOS的默認(rèn)值不同。
TextRank用固定窗口大?。J(rèn)為5,通過span屬性調(diào)整),以詞作為節(jié)點(diǎn),以詞之間的共現(xiàn)關(guān)系作為邊,構(gòu)建無向帶權(quán)圖。
然后計(jì)算圖中節(jié)點(diǎn)的得分,計(jì)算方式類似PageRank。
對PageRank的計(jì)算方式和原理的更深入了解可以參考我之前撰寫的博文:cs224w(圖機(jī)器學(xué)習(xí))2021冬季課程學(xué)習(xí)筆記4 Link Analysis: PageRank (Graph as Matrix)_諸神緘默不語的博客-CSDN博客
3. 基于TextRank算法的中文關(guān)鍵詞提取(使用textrank_zh包實(shí)現(xiàn))
待補(bǔ)。
3. 沒說基于什么算法的中文詞語重要性:LAC實(shí)現(xiàn)
最后輸出的數(shù)值就是對應(yīng)詞語的重要性得分。
extracted_sentences="隨著企業(yè)持續(xù)產(chǎn)生的商品銷量,其數(shù)據(jù)對于自身營銷規(guī)劃、市場分析、物流規(guī)劃都有重要意義。但是銷量預(yù)測的影響因素繁多,傳統(tǒng)的基于統(tǒng)計(jì)的計(jì)量模型,比如時(shí)間序列模型等由于對現(xiàn)實(shí)的假設(shè)情況過多,導(dǎo)致預(yù)測結(jié)果較差。因此需要更加優(yōu)秀的智能AI算法,以提高預(yù)測的準(zhǔn)確性,從而助力企業(yè)降低庫存成本、縮短交貨周期、提高企業(yè)抗風(fēng)險(xiǎn)能力。" from LAC import LAC lac=LAC(mode='rank') seg_result=lac.run(extracted_sentences) #以Unicode字符串為入?yún)? print(seg_result)
輸出:
W0625 20:13:22.369424 33363 init.cc:157] AVX is available, Please re-compile on local machine W0625 20:13:22.455566 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled. W0625 20:13:22.455617 33363 init.cc:157] AVX is available, Please re-compile on local machine --- Running analysis [ir_graph_build_pass] --- Running analysis [ir_graph_clean_pass] --- Running analysis [ir_analysis_pass] --- Running IR pass [simplify_with_basic_ops_pass] --- Running IR pass [attention_lstm_fuse_pass] --- Running IR pass [seqconv_eltadd_relu_fuse_pass] --- Running IR pass [seqpool_cvm_concat_fuse_pass] --- Running IR pass [fc_lstm_fuse_pass] --- Running IR pass [mul_lstm_fuse_pass] --- Running IR pass [fc_gru_fuse_pass] --- Running IR pass [mul_gru_fuse_pass] --- Running IR pass [seq_concat_fc_fuse_pass] --- Running IR pass [fc_fuse_pass] --- Running IR pass [repeated_fc_relu_fuse_pass] --- Running IR pass [squared_mat_sub_fuse_pass] --- Running IR pass [conv_bn_fuse_pass] --- Running IR pass [conv_eltwiseadd_bn_fuse_pass] --- Running IR pass [is_test_pass] --- Running IR pass [runtime_context_cache_pass] --- Running analysis [ir_params_sync_among_devices_pass] --- Running analysis [adjust_cudnn_workspace_size_pass] --- Running analysis [inference_op_replace_pass] --- Running analysis [ir_graph_to_program_pass] W0625 20:13:22.561131 33363 analysis_predictor.cc:518] - GLOG's LOG(INFO) is disabled. W0625 20:13:22.561169 33363 init.cc:157] AVX is available, Please re-compile on local machine --- Running analysis [ir_graph_build_pass] --- Running analysis [ir_graph_clean_pass] --- Running analysis [ir_analysis_pass] --- Running IR pass [simplify_with_basic_ops_pass] --- Running IR pass [attention_lstm_fuse_pass] --- Running IR pass [seqconv_eltadd_relu_fuse_pass] --- Running IR pass [seqpool_cvm_concat_fuse_pass] --- Running IR pass [fc_lstm_fuse_pass] --- Running IR pass [mul_lstm_fuse_pass] --- Running IR pass [fc_gru_fuse_pass] --- Running IR pass [mul_gru_fuse_pass] --- Running IR pass [seq_concat_fc_fuse_pass] --- Running IR pass [fc_fuse_pass] --- Running IR pass [repeated_fc_relu_fuse_pass] --- Running IR pass [squared_mat_sub_fuse_pass] --- Running IR pass [conv_bn_fuse_pass] --- Running IR pass [conv_eltwiseadd_bn_fuse_pass] --- Running IR pass [is_test_pass] --- Running IR pass [runtime_context_cache_pass] --- Running analysis [ir_params_sync_among_devices_pass] --- Running analysis [adjust_cudnn_workspace_size_pass] --- Running analysis [inference_op_replace_pass] --- Running analysis [ir_graph_to_program_pass] [['隨著', '企業(yè)', '持續(xù)', '產(chǎn)生', '的', '商品', '銷量', ',', '其', '數(shù)據(jù)', '對于', '自身', '營銷', '規(guī)劃', '、', '市場分析', '、', '物流', '規(guī)劃', '都', '有', '重要', '意義', '。', '但是', '銷量', '預(yù)測', '的', '影響', '因素', '繁多', ',', '傳統(tǒng)', '的', '基于', '統(tǒng)計(jì)', '的', '計(jì)量', '模型', ',', '比如', '時(shí)間', '序列', '模型', '等', '由于', '對', '現(xiàn)實(shí)', '的', '假設(shè)', '情況', '過多', ',', '導(dǎo)致', '預(yù)測', '結(jié)果', '較差', '。', '因此', '需要', '更加', '優(yōu)秀', '的', '智能', 'AI算法', ',', '以', '提高', '預(yù)測', '的', '準(zhǔn)確性', ',', '從而', '助力', '企業(yè)', '降低', '庫存', '成本', '、', '縮短', '交貨', '周期', '、', '提高', '企業(yè)', '抗', '風(fēng)險(xiǎn)', '能力', '。'], ['p', 'n', 'vd', 'v', 'u', 'n', 'n', 'w', 'r', 'n', 'p', 'r', 'vn', 'n', 'w', 'n', 'w', 'n', 'n', 'd', 'v', 'a', 'n', 'w', 'c', 'n', 'vn', 'u', 'vn', 'n', 'a', 'w', 'a', 'u', 'p', 'v', 'u', 'vn', 'n', 'w', 'v', 'n', 'n', 'n', 'u', 'p', 'p', 'n', 'u', 'vn', 'n', 'a', 'w', 'v', 'vn', 'n', 'a', 'w', 'c', 'v', 'd', 'a', 'u', 'n', 'nz', 'w', 'p', 'v', 'vn', 'u', 'n', 'w', 'c', 'v', 'n', 'v', 'n', 'n', 'w', 'v', 'vn', 'n', 'w', 'v', 'n', 'v', 'n', 'n', 'w'], [0, 1, 1, 1, 0, 2, 2, 0, 1, 2, 0, 1, 2, 2, 0, 2, 0, 2, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 1, 2, 0, 2, 0, 0, 2, 0, 2, 1, 0, 1, 2, 2, 1, 0, 0, 0, 2, 0, 2, 1, 2, 0, 1, 2, 2, 2, 0, 0, 1, 1, 2, 0, 2, 2, 0, 0, 2, 2, 0, 2, 0, 0, 2, 1, 1, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 0]]
4. KeyBert
# 導(dǎo)入pandas用于讀取表格數(shù)據(jù) import pandas as pd # 導(dǎo)入BOW(詞袋模型),可以選擇將CountVectorizer替換為TfidfVectorizer(TF-IDF(詞頻-逆文檔頻率)),注意上下文要同時(shí)修改,親測后者效果更佳 from sklearn.feature_extraction.text import TfidfVectorizer # 導(dǎo)入Bert模型 from sentence_transformers import SentenceTransformer # 導(dǎo)入計(jì)算相似度前置庫,為了計(jì)算候選者和文檔之間的相似度,我們將使用向量之間的余弦相似度,因?yàn)樗诟呔S度下表現(xiàn)得相當(dāng)好。 from sklearn.metrics.pairwise import cosine_similarity # 過濾警告消息 from warnings import simplefilter from sklearn.exceptions import ConvergenceWarning simplefilter("ignore", category=ConvergenceWarning) # 讀取數(shù)據(jù)集 test = pd.read_csv('./基于論文摘要的文本分類與關(guān)鍵詞抽取挑戰(zhàn)賽公開數(shù)據(jù)/test.csv') test['title'] = test['title'].fillna('') test['abstract'] = test['abstract'].fillna('') test['text'] = test['title'].fillna('') + ' ' +test['abstract'].fillna('') #停用詞文件鏈接:https://pan.baidu.com/s/1mQ50_gsKZHWERHzfiDnheg?pwd=qzuc # 定義停用詞,去掉出現(xiàn)較多,但對文章不關(guān)鍵的詞語 stops =[i.strip() for i in open(r'stop.txt',encoding='utf-8').readlines()] model = SentenceTransformer(r'xlm-r-distilroberta-base-paraphrase-v1') #這里我們使用distiluse-base-multilingual-cased,因?yàn)樗谙嗨菩匀蝿?wù)中表現(xiàn)出了很好的性能,這也是我們對關(guān)鍵詞/關(guān)鍵短語提取的目標(biāo)! #由于transformer模型有token長度限制,所以在輸入大型文檔時(shí),你可能會(huì)遇到一些錯(cuò)誤。在這種情況下,您可以考慮將您的文檔分割成幾個(gè)小的段落,并對其產(chǎn)生的向量進(jìn)行平均池化(mean pooling ,要取平均值)。 #提取關(guān)鍵詞 #這里我的思路是獲取文本內(nèi)容的embedding,同時(shí)與文本標(biāo)題的embedding進(jìn)行比較,文章的關(guān)鍵詞往往與標(biāo)題內(nèi)容有很強(qiáng)的相似性,為了計(jì)算候選者和文檔之間的相似度,我們將使用向量之間的余弦相似度,因?yàn)樗诟呔S度下表現(xiàn)得相當(dāng)好。 test_words = [] for row in test.iterrows(): # 讀取第每一行數(shù)據(jù)的標(biāo)題與摘要并提取關(guān)鍵詞 # 修改n_gram_range來改變結(jié)果候選詞的詞長大小。例如,如果我們將它設(shè)置為(3,3),那么產(chǎn)生的候選詞將是包含3個(gè)關(guān)鍵詞的短語。 n_gram_range = (2,2) # 這里我們使用TF-IDF算法來獲取候選關(guān)鍵詞 count = TfidfVectorizer(ngram_range=n_gram_range, stop_words=stops).fit([row[1].text]) candidates = count.get_feature_names_out() # 將文本標(biāo)題以及候選關(guān)鍵詞/關(guān)鍵短語轉(zhuǎn)換為數(shù)值型數(shù)據(jù)(numerical data)。我們使用BERT來實(shí)現(xiàn)這一目的 title_embedding = model.encode([row[1].title]) candidate_embeddings = model.encode(candidates) # 通過修改這個(gè)參數(shù)來更改關(guān)鍵詞數(shù)量 top_n = 15 # 利用文章標(biāo)題進(jìn)一步提取關(guān)鍵詞 distances = cosine_similarity(title_embedding, candidate_embeddings) keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]] if len( keywords) == 0: keywords = ['A', 'B'] test_words.append('; '.join( keywords)) test['Keywords'] = test_words test[['uuid', 'Keywords']].to_csv('submit_task2.csv', index=None)
5. 基于詞頻的
包括unigram和bigram
# 引入分詞器 from nltk import word_tokenize, ngrams # 定義停用詞,去掉出現(xiàn)較多,但對文章不關(guān)鍵的詞語 stops = [ 'will', 'can', "couldn't", 'same', 'own', "needn't", 'between', "shan't", 'very', 'so', 'over', 'in', 'have', 'the', 's', 'didn', 'few', 'should', 'of', 'that', 'don', 'weren', 'into', "mustn't", 'other', 'from', "she's", 'hasn', "you're", 'ain', 'ours', 'them', 'he', 'hers', 'up', 'below', 'won', 'out', 'through', 'than', 'this', 'who', "you've", 'on', 'how', 'more', 'being', 'any', 'no', 'mightn', 'for', 'again', 'nor', 'there', 'him', 'was', 'y', 'too', 'now', 'whom', 'an', 've', 'or', 'itself', 'is', 'all', "hasn't", 'been', 'themselves', 'wouldn', 'its', 'had', "should've", 'it', "you'll", 'are', 'be', 'when', "hadn't", "that'll", 'what', 'while', 'above', 'such', 'we', 't', 'my', 'd', 'i', 'me', 'at', 'after', 'am', 'against', 'further', 'just', 'isn', 'haven', 'down', "isn't", "wouldn't", 'some', "didn't", 'ourselves', 'their', 'theirs', 'both', 're', 'her', 'ma', 'before', "don't", 'having', 'where', 'shouldn', 'under', 'if', 'as', 'myself', 'needn', 'these', 'you', 'with', 'yourself', 'those', 'each', 'herself', 'off', 'to', 'not', 'm', "it's", 'does', "weren't", "aren't", 'were', 'aren', 'by', 'doesn', 'himself', 'wasn', "you'd", 'once', 'because', 'yours', 'has', "mightn't", 'they', 'll', "haven't", 'but', 'couldn', 'a', 'do', 'hadn', "doesn't", 'your', 'she', 'yourselves', 'o', 'our', 'here', 'and', 'his', 'most', 'about', 'shan', "wasn't", 'then', 'only', 'mustn', 'doing', 'during', 'why', "won't", 'until', 'did', "shouldn't", 'which' ] # 定義方法按照詞頻篩選關(guān)鍵詞 def extract_keywords_by_freq(title, abstract): ngrams_count = list(ngrams(word_tokenize(title.lower()), 2)) + list(ngrams(word_tokenize(abstract.lower()), 2)) ngrams_count = pd.DataFrame(ngrams_count) ngrams_count = ngrams_count[~ngrams_count[0].isin(stops)] ngrams_count = ngrams_count[~ngrams_count[1].isin(stops)] ngrams_count = ngrams_count[ngrams_count[0].apply(len) > 3] ngrams_count = ngrams_count[ngrams_count[1].apply(len) > 3] ngrams_count['phrase'] = ngrams_count[0] + ' ' + ngrams_count[1] ngrams_count = ngrams_count['phrase'].value_counts() ngrams_count = ngrams_count[ngrams_count > 1] return list(ngrams_count.index)[:5] ## 對測試集提取關(guān)鍵詞 test_words = [] for row in test.iterrows(): # 讀取第每一行數(shù)據(jù)的標(biāo)題與摘要并提取關(guān)鍵詞 prediction_keywords = extract_keywords_by_freq(row[1].title, row[1].abstract) # 利用文章標(biāo)題進(jìn)一步提取關(guān)鍵詞 prediction_keywords = [x.title() for x in prediction_keywords] # 如果未能提取到關(guān)鍵詞 if len(prediction_keywords) == 0: prediction_keywords = ['A', 'B'] test_words.append('; '.join(prediction_keywords)) test['Keywords'] = test_words test[['uuid', 'Keywords', 'label']].to_csv('submit_task2.csv', index=None)
其他資料
- 我還沒看,但是反正是關(guān)于關(guān)鍵詞抽取主題的(2023 APSIT) A Comparative Study on Keyword Extraction and Generation of Synonyms in Natural Language Processing:對比基于規(guī)則的模型、統(tǒng)計(jì)模型和extreme learning machine (ELM)模型
- 術(shù)語提取,離關(guān)鍵詞抽取差不太多:術(shù)語提取算法綜述 | 集智斑圖
The Termolator:開源的術(shù)語抽取算法,通過類似TF-IDF算法的邏輯找到在目標(biāo)專業(yè)領(lǐng)域中出現(xiàn)頻繁、而在通用文檔中出現(xiàn)較少的術(shù)語,通過語言學(xué)和stemming思路重排術(shù)語,從而得到結(jié)果
到此這篇關(guān)于Python3中英文關(guān)鍵詞提取的幾個(gè)方法的文章就介紹到這了,更多相關(guān)Python3關(guān)鍵詞提取內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
使用Python和scikit-learn創(chuàng)建混淆矩陣的示例詳解
這篇文章主要介紹了使用Python和scikit-learn創(chuàng)建混淆矩陣的示例詳解,該示例包括生成數(shù)據(jù)集、為數(shù)據(jù)集選擇合適的機(jī)器學(xué)習(xí)模型、構(gòu)建、配置和訓(xùn)練它,最后解釋結(jié)果,即混淆矩陣,需要的朋友可以參考下2022-06-06python實(shí)現(xiàn)神經(jīng)網(wǎng)絡(luò)感知器算法
這篇文章主要為大家詳細(xì)介紹了python實(shí)現(xiàn)神經(jīng)網(wǎng)絡(luò)感知器算法,文中示例代碼介紹的非常詳細(xì),具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2017-12-12python dict 相同key 合并value的實(shí)例
今天小編就為大家分享一篇python dict 相同key 合并value的實(shí)例,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2019-01-01