欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

python初步實現(xiàn)word2vec操作

 更新時間:2020年06月09日 11:24:09   作者:小拳頭  
這篇文章主要介紹了python初步實現(xiàn)word2vec操作,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧

一、前言

一開始看到word2vec環(huán)境的安裝還挺復(fù)雜的,安了半天Cygwin也沒太搞懂。后來突然發(fā)現(xiàn),我為什么要去安c語言版本的呢,我應(yīng)該去用python版本的,然后就發(fā)現(xiàn)了gensim,安裝個gensim的包就可以用word2vec了,不過gensim只實現(xiàn)了word2vec里面的skip-gram模型。若要用到其他模型,就需要去研究其他語言的word2vec了。

二、語料準(zhǔn)備

有了gensim包之后,看了網(wǎng)上很多教程都是直接傳入一個txt文件,但是這個txt文件長啥樣,是什么樣的數(shù)據(jù)格式呢,很多博客都沒有說明,也沒有提供可以下載的txt文件作為例子。進一步理解之后發(fā)現(xiàn)這個txt是一個包含巨多文本的分好詞的文件。如下圖所示,是我自己訓(xùn)練的一個語料,我選取了自己之前用爬蟲抓取的7000條新聞當(dāng)做語料并進行分詞。注意,詞與詞之間一定要用空格:

這里分詞使用的是結(jié)巴分詞。

這部分代碼如下:

import jieba
f1 =open("fenci.txt")
f2 =open("fenci_result.txt", 'a')
lines =f1.readlines() # 讀取全部內(nèi)容
for line in lines:
  line.replace('\t', '').replace('\n', '').replace(' ','')
  seg_list = jieba.cut(line, cut_all=False)
  f2.write(" ".join(seg_list))
 
f1.close()
f2.close()

還要注意的一點就是語料中的文本一定要多,看網(wǎng)上隨便一個語料都是好幾個G,而且一開始我就使用了一條新聞當(dāng)成語料庫,結(jié)果很不好,輸出都是0。然后我就用了7000條新聞作為語料庫,分詞完之后得到的fenci_result.txt是20M,雖然也不大,但是已經(jīng)可以得到初步結(jié)果了。

三、使用gensim的word2vec訓(xùn)練模型

相關(guān)代碼如下:

from gensim.modelsimport word2vec
import logging
 
# 主程序
logging.basicConfig(format='%(asctime)s:%(levelname)s: %(message)s', level=logging.INFO)
sentences =word2vec.Text8Corpus(u"fenci_result.txt") # 加載語料
model =word2vec.Word2Vec(sentences, size=200) #訓(xùn)練skip-gram模型,默認(rèn)window=5
 
print model
# 計算兩個詞的相似度/相關(guān)程度
try:
  y1 = model.similarity(u"國家", u"國務(wù)院")
except KeyError:
  y1 = 0
print u"【國家】和【國務(wù)院】的相似度為:", y1
print"-----\n"
#
# 計算某個詞的相關(guān)詞列表
y2 = model.most_similar(u"控?zé)?, topn=20) # 20個最相關(guān)的
print u"和【控?zé)煛孔钕嚓P(guān)的詞有:\n"
for item in y2:
  print item[0], item[1]
print"-----\n"
 
# 尋找對應(yīng)關(guān)系
print u"書-不錯,質(zhì)量-"
y3 =model.most_similar([u'質(zhì)量', u'不錯'], [u'書'], topn=3)
for item in y3:
  print item[0], item[1]
print"----\n"
 
# 尋找不合群的詞
y4 =model.doesnt_match(u"書 書籍 教材 很".split())
print u"不合群的詞:", y4
print"-----\n"
 
# 保存模型,以便重用
model.save(u"書評.model")
# 對應(yīng)的加載方式
# model_2 =word2vec.Word2Vec.load("text8.model")
 
# 以一種c語言可以解析的形式存儲詞向量
#model.save_word2vec_format(u"書評.model.bin", binary=True)
# 對應(yīng)的加載方式
# model_3 =word2vec.Word2Vec.load_word2vec_format("text8.model.bin",binary=True)

輸出如下:

"D:\program files\python2.7.0\python.exe" "D:/pycharm workspace/畢設(shè)/cluster_test/word2vec.py"
D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
 warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
 warnings.warn("Pattern library is not installed, lemmatization won't be available.")
2016-12-12 15:37:43,331: INFO: collecting all words and their counts
2016-12-12 15:37:43,332: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-12-12 15:37:45,236: INFO: collected 99865 word types from a corpus of 3561156 raw words and 357 sentences
2016-12-12 15:37:45,236: INFO: Loading a fresh vocabulary
2016-12-12 15:37:45,413: INFO: min_count=5 retains 29982 unique words (30% of original 99865, drops 69883)
2016-12-12 15:37:45,413: INFO: min_count=5 leaves 3444018 word corpus (96% of original 3561156, drops 117138)
2016-12-12 15:37:45,602: INFO: deleting the raw counts dictionary of 99865 items
2016-12-12 15:37:45,615: INFO: sample=0.001 downsamples 29 most-common words
2016-12-12 15:37:45,615: INFO: downsampling leaves estimated 2804247 word corpus (81.4% of prior 3444018)
2016-12-12 15:37:45,615: INFO: estimated required memory for 29982 words and 200 dimensions: 62962200 bytes
2016-12-12 15:37:45,746: INFO: resetting layer weights
2016-12-12 15:37:46,782: INFO: training model with 3 workers on 29982 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2016-12-12 15:37:46,782: INFO: expecting 357 sentences, matching count from corpus used for vocabulary survey
2016-12-12 15:37:47,818: INFO: PROGRESS: at 1.96% examples, 267531 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:48,844: INFO: PROGRESS: at 3.70% examples, 254229 words/s, in_qsize 3, out_qsize 1
2016-12-12 15:37:49,871: INFO: PROGRESS: at 5.99% examples, 273509 words/s, in_qsize 3, out_qsize 1
2016-12-12 15:37:50,867: INFO: PROGRESS: at 8.18% examples, 281557 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:51,872: INFO: PROGRESS: at 10.20% examples, 280918 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:37:52,898: INFO: PROGRESS: at 12.44% examples, 284750 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:53,911: INFO: PROGRESS: at 14.17% examples, 278948 words/s, in_qsize 0, out_qsize 0
2016-12-12 15:37:54,956: INFO: PROGRESS: at 16.47% examples, 284101 words/s, in_qsize 2, out_qsize 1
2016-12-12 15:37:55,934: INFO: PROGRESS: at 18.60% examples, 285781 words/s, in_qsize 6, out_qsize 1
2016-12-12 15:37:56,933: INFO: PROGRESS: at 20.84% examples, 288045 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:37:57,973: INFO: PROGRESS: at 23.03% examples, 289083 words/s, in_qsize 6, out_qsize 2
2016-12-12 15:37:58,993: INFO: PROGRESS: at 24.87% examples, 285990 words/s, in_qsize 6, out_qsize 1
2016-12-12 15:38:00,006: INFO: PROGRESS: at 27.17% examples, 288266 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:01,081: INFO: PROGRESS: at 29.52% examples, 290197 words/s, in_qsize 1, out_qsize 2
2016-12-12 15:38:02,065: INFO: PROGRESS: at 31.88% examples, 292344 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:03,188: INFO: PROGRESS: at 34.01% examples, 291356 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:04,161: INFO: PROGRESS: at 36.02% examples, 290805 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:05,174: INFO: PROGRESS: at 38.26% examples, 292174 words/s, in_qsize 3, out_qsize 0
2016-12-12 15:38:06,214: INFO: PROGRESS: at 40.56% examples, 293297 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:07,201: INFO: PROGRESS: at 42.69% examples, 293428 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:08,266: INFO: PROGRESS: at 44.65% examples, 292108 words/s, in_qsize 1, out_qsize 1
2016-12-12 15:38:09,295: INFO: PROGRESS: at 46.83% examples, 292097 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:10,315: INFO: PROGRESS: at 49.13% examples, 292968 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:11,326: INFO: PROGRESS: at 51.37% examples, 293621 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:38:12,367: INFO: PROGRESS: at 53.39% examples, 292777 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:13,348: INFO: PROGRESS: at 55.35% examples, 292187 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:38:14,349: INFO: PROGRESS: at 57.31% examples, 291656 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:15,374: INFO: PROGRESS: at 59.50% examples, 292019 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:16,403: INFO: PROGRESS: at 61.68% examples, 292318 words/s, in_qsize 4, out_qsize 2
2016-12-12 15:38:17,401: INFO: PROGRESS: at 63.81% examples, 292275 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:18,410: INFO: PROGRESS: at 65.71% examples, 291495 words/s, in_qsize 4, out_qsize 1
2016-12-12 15:38:19,433: INFO: PROGRESS: at 67.62% examples, 290443 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:20,473: INFO: PROGRESS: at 69.58% examples, 289655 words/s, in_qsize 6, out_qsize 2
2016-12-12 15:38:21,589: INFO: PROGRESS: at 71.71% examples, 289388 words/s, in_qsize 2, out_qsize 2
2016-12-12 15:38:22,533: INFO: PROGRESS: at 73.78% examples, 289366 words/s, in_qsize 0, out_qsize 1
2016-12-12 15:38:23,611: INFO: PROGRESS: at 75.46% examples, 287542 words/s, in_qsize 5, out_qsize 1
2016-12-12 15:38:24,614: INFO: PROGRESS: at 77.25% examples, 286609 words/s, in_qsize 3, out_qsize 0
2016-12-12 15:38:25,609: INFO: PROGRESS: at 79.33% examples, 286732 words/s, in_qsize 5, out_qsize 1
2016-12-12 15:38:26,621: INFO: PROGRESS: at 81.40% examples, 286595 words/s, in_qsize 2, out_qsize 0
2016-12-12 15:38:27,625: INFO: PROGRESS: at 83.53% examples, 286807 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:28,683: INFO: PROGRESS: at 85.32% examples, 285651 words/s, in_qsize 5, out_qsize 3
2016-12-12 15:38:29,729: INFO: PROGRESS: at 87.56% examples, 286175 words/s, in_qsize 6, out_qsize 1
2016-12-12 15:38:30,706: INFO: PROGRESS: at 89.86% examples, 286920 words/s, in_qsize 5, out_qsize 0
2016-12-12 15:38:31,714: INFO: PROGRESS: at 92.10% examples, 287368 words/s, in_qsize 6, out_qsize 0
2016-12-12 15:38:32,756: INFO: PROGRESS: at 94.40% examples, 288070 words/s, in_qsize 4, out_qsize 2
2016-12-12 15:38:33,755: INFO: PROGRESS: at 96.30% examples, 287543 words/s, in_qsize 1, out_qsize 0
2016-12-12 15:38:34,802: INFO: PROGRESS: at 98.71% examples, 288375 words/s, in_qsize 4, out_qsize 0
2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 2 more threads
2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 1 more threads
Word2Vec(vocab=29982, size=200, alpha=0.025)
【國家】和【國務(wù)院】的相似度為: 0.387535493256
-----
2016-12-12 15:38:35,293: INFO: worker thread finished; awaiting finish of 0 more threads
2016-12-12 15:38:35,293: INFO: training on 17805780 raw words (14021191 effective words) took 48.5s, 289037 effective words/s
2016-12-12 15:38:35,293: INFO: precomputing L2-norms of word weight vectors
和【控?zé)煛孔钕嚓P(guān)的詞有:
禁煙 0.6038454175
防煙 0.585186183453
執(zhí)行 0.530897378922
煙控 0.516572892666
廣而告之 0.508533298969
履約 0.507428050041
執(zhí)法 0.494115233421
禁煙令 0.471616715193
修法 0.465247869492
該項 0.457907706499
落實 0.457776963711
控制 0.455987215042
這方面 0.450040221214
立法 0.44820779562
控?zé)熮k 0.436062157154
執(zhí)行力 0.432559013367
控?zé)煏?0.430508673191
進展 0.430286765099
監(jiān)管 0.429748386145
懲罰 0.429243773222
-----
書-不錯,質(zhì)量-
生存 0.613928854465
穩(wěn)定 0.595371186733
整體 0.592055797577
----
不合群的詞: 很
-----
2016-12-12 15:38:35,515: INFO: saving Word2Vec object under 書評.model, separately None
2016-12-12 15:38:35,515: INFO: not storing attribute syn0norm
2016-12-12 15:38:35,515: INFO: not storing attribute cum_table
2016-12-12 15:38:36,490: INFO: saved 書評.model
Process finished with exit code 0

以上這篇python初步實現(xiàn)word2vec操作就是小編分享給大家的全部內(nèi)容了,希望能給大家一個參考,也希望大家多多支持腳本之家。

相關(guān)文章

  • pycharm運行程序時出現(xiàn)Run‘python tests for XXX.py‘問題及解決

    pycharm運行程序時出現(xiàn)Run‘python tests for XXX.py‘問題及

    這篇文章主要介紹了pycharm運行程序時出現(xiàn)Run ‘python tests for XXX.py‘問題及解決方案,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教
    2023-08-08
  • Sublime Text3最新激活注冊碼分享適用2020最新版 親測可用

    Sublime Text3最新激活注冊碼分享適用2020最新版 親測可用

    這篇文章主要介紹了Sublime Text3最新激活注冊碼分享親測3211可用
    2020-11-11
  • Python批量操作Excel文件詳解

    Python批量操作Excel文件詳解

    因為博主所在的地方,需要每周整理全校的青年大學(xué)習(xí)數(shù)據(jù),Excel操作本身不難,但是這種毫無意義的體力勞動做久了就會很無趣,剛好我想起來上學(xué)期接觸過Python,想著能不能試一下,取代這種無意義的勞動
    2021-11-11
  • flask實現(xiàn)python方法轉(zhuǎn)換服務(wù)的方法

    flask實現(xiàn)python方法轉(zhuǎn)換服務(wù)的方法

    flask是一個web框架,可以通過提供的裝飾器@server.route()將普通函數(shù)轉(zhuǎn)換為服務(wù),這篇文章主要介紹了flask實現(xiàn)python方法轉(zhuǎn)換服務(wù),需要的朋友可以參考下
    2022-05-05
  • 關(guān)于python scrapy中添加cookie踩坑記錄

    關(guān)于python scrapy中添加cookie踩坑記錄

    這篇文章主要介紹了關(guān)于python scrapy中添加cookie踩坑記錄,本文通過實例代碼給大家介紹的非常詳細(xì),對大家的學(xué)習(xí)或工作具有一定的參考借鑒價值,需要的朋友可以參考下
    2020-11-11
  • Python中json.dumps()函數(shù)使用和示例

    Python中json.dumps()函數(shù)使用和示例

    這篇文章主要介紹了Python中json.dumps()函數(shù)使用和示例,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教
    2024-03-03
  • python操作MongoDB基礎(chǔ)知識

    python操作MongoDB基礎(chǔ)知識

    MongoDB支持好多語言,今天我們就寫一寫python操作MongoDB的知識,從安裝MongoDB到操作MongoDB全都有了。
    2013-11-11
  • Python Numpy實現(xiàn)計算矩陣的均值和標(biāo)準(zhǔn)差詳解

    Python Numpy實現(xiàn)計算矩陣的均值和標(biāo)準(zhǔn)差詳解

    NumPy(Numerical Python)是Python的一種開源的數(shù)值計算擴展。這種工具可用來存儲和處理大型矩陣,比Python自身的嵌套列表結(jié)構(gòu)要高效的多。本文主要介紹用NumPy實現(xiàn)計算矩陣的均值和標(biāo)準(zhǔn)差,感興趣的小伙伴可以了解一下
    2021-11-11
  • 詳解Python數(shù)據(jù)結(jié)構(gòu)與算法中的順序表

    詳解Python數(shù)據(jù)結(jié)構(gòu)與算法中的順序表

    線性表在計算機中的表示可以采用多種方法,采用不同存儲方法的線性表也有著不同的名稱和特點。線性表有兩種基本的存儲結(jié)構(gòu):順序存儲結(jié)構(gòu)和鏈?zhǔn)酱鎯Y(jié)構(gòu)。本文將介紹順序存儲結(jié)構(gòu)的特點以及各種基本運算的實現(xiàn)。需要的可以參考一下
    2022-01-01
  • Python字符串str超詳細(xì)詳解(適合新手!)

    Python字符串str超詳細(xì)詳解(適合新手!)

    str函數(shù)是Python的內(nèi)置函數(shù),它將參數(shù)轉(zhuǎn)換成字符串類型,即人適合閱讀的形式,下面這篇文章主要給大家介紹了關(guān)于Python字符串str超詳細(xì)詳解的相關(guān)資料,文中通過實例代碼介紹的非常詳細(xì),需要的朋友可以參考下
    2022-11-11

最新評論