快捷導(dǎo)航

關(guān)于Word2Vec可視化展示

更新時間：2022年11月02日 09:30:15 作者：Eureka丶

這篇文章主要介紹了關(guān)于Word2Vec可視化展示，具有很好的參考價值，希望對大家有所幫助。如有錯誤或未考慮完全的地方，望不吝賜教

Word2Vec簡介

自然語言處理的核心概念之一是如何量化單詞和表達式，以便能夠在模型環(huán)境中使用它們。語言元素到數(shù)值表示的這種映射稱為詞嵌入。

Word2Vec是一個詞嵌入過程。這個概念相對簡單：通過一個句子一個句子地在語料庫中循環(huán)去擬合一個模型，根據(jù)預(yù)先定義的窗口中的相鄰單詞預(yù)測當(dāng)前單詞。

為此，它使用了一個神經(jīng)網(wǎng)絡(luò)，但實際上最后我們并不使用預(yù)測的結(jié)果。一旦模型被保存，我們只保存隱藏層的權(quán)重。在我們將要使用的原始模型中，有300個權(quán)重，因此每個單詞都由一個300維向量表示。

請注意，兩個單詞不必彼此接近的地方才被認為是相似的。如果兩個詞從來沒有出現(xiàn)在同一個句子中，但它們通常被相同的包圍，那么可以肯定它們有相似的意思。

Word2Vec中有兩種建模方法：skip-gram和continuous bag of words，這兩種方法都有各自的優(yōu)點和對某些超參數(shù)的敏感性。

當(dāng)然，你得到的詞向量取決于你訓(xùn)練模型的語料庫。一般來說，你確實需要一個龐大的語料庫，有維基百科上訓(xùn)練過的版本，或者來自不同來源的新聞文章。我們將要使用的結(jié)果是在Google新聞上訓(xùn)練出來的。

簡單可視化

自定義一個很小的語料庫，嘗試給出Word2Vec的簡單可視化：

import gensim
 
%matplotlib inline
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
 
# 訓(xùn)練的語料
sentences = [['this', 'is', 'the', 'an', 'apple', 'for', 'you'],
             ['this', 'is', 'the', 'an', 'orange', 'for', 'you'],
             ['this', 'is', 'the', 'an', 'banana', 'for', 'you'],
             ['apple','is','delicious'],
             ['apple','is','sad'],
             ['orange','is','delicious'],
             ['orange','is','sad'],
             ['apple','tests','delicious'],
             ['orange','tests','delicious']]
 
# 利用語料訓(xùn)練模型
model = Word2Vec(sentences,window=5, min_count=1)
 
# 基于2d PCA擬合數(shù)據(jù)
# X = model[model.wv.vocab]
X = model.wv[model.wv.key_to_index]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
 
# 可視化展示
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.key_to_index)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

因為語料庫是隨機給出的，并且數(shù)量很少，所以訓(xùn)練出來的詞向量展示出來的詞和詞之間的相關(guān)性不那么強。這里主要是想表明假如我們輸入一系列單詞，通過Word2Vec模型可以得到什么樣的輸出。

實戰(zhàn)演練

通過已經(jīng)在Google新聞的語料上訓(xùn)練好的模型來看看Word2Vec得到的詞向量都可以怎么使用。

首先需要下載預(yù)訓(xùn)練Word2Vec向量，這可以從各種各樣的背景領(lǐng)域中進行選擇?；贕oogle新聞?wù)Z料庫的訓(xùn)練模型可通過搜索“Google News vectors negative 300”來下載。這個文件大小是1.53GB，包含了30億單詞的300維表示。

和上述在Python中的簡單可視化一樣，需要使用gensim庫。假設(shè)剛才下載好的文件保存在電腦的E盤的“wordpretrain”文件夾中。

from gensim.models.keyedvectors import KeyedVectors
 
word_vectors = KeyedVectors.load_word2vec_format(\
    'E:\wordpretrain/GoogleNews-vectors-negative300.bin.gz', \
    binary = True, limit = 1000000)

如此，便擁有了一個現(xiàn)成的詞向量模型，亦即每個單詞都由一個300維的向量唯一表示。下面我們來看看關(guān)于它的一些簡單用法。

1、可以實際查看任意單詞的向量表示：

word_vectors['dog']

但很難解釋這個向量的每一維代表什么意思。

2、可以使用most_similar函數(shù)找到意思相近的單詞，topn參數(shù)定義要列出的單詞數(shù)：

word_vectors.most_similar(positive = ['nice'], topn = 5)

括號中的數(shù)字表示相似度的大小。

3、如果我們想合并father和woman這兩個單詞的向量，并減去man這個單詞的向量，可以得到：

word_vectors.most_similar(
positive = ['father', 'woman'], negative = ['man'], topn = 1)

其實這件事情很容易想到：假設(shè)在兩個維度（親子關(guān)系和性別）下，“woman”這個單詞的向量為(0,1)，“man”的向量為(0,-1)，“father”的向量為(1,-1)，“mother”的向量為(1,1)，那么“father”+“woman”-“man”= (1,-1) + (0,1) - (0,-1) = (1,1) =“mother”。當(dāng)然，區(qū)別在于這里我們有300個維度，但原理上是相同的。

4、可視化：

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
import adjustText

from jupyterthemes import jtplot
jtplot.style(theme='onedork') #選擇一個繪圖主題

def plot_2d_representation_of_words(
    word_list, 
    word_vectors, 
    flip_x_axis = False,
    flip_y_axis = False,
    label_x_axis = "x",
    label_y_axis = "y", 
    label_label = "fruit"):
    
    pca = PCA(n_components = 2)
    
    word_plus_coordinates=[]
    
    for word in word_list: 
        current_row = []
        current_row.append(word)
        current_row.extend(word_vectors[word])
        word_plus_coordinates.append(current_row)
    
    word_plus_coordinates = pd.DataFrame(word_plus_coordinates)
        
    coordinates_2d = pca.fit_transform(
        word_plus_coordinates.iloc[:,1:300])
    coordinates_2d = pd.DataFrame(
        coordinates_2d, columns=[label_x_axis, label_y_axis])
    coordinates_2d[label_label] = word_plus_coordinates.iloc[:,0]
    if flip_x_axis:
        coordinates_2d[label_x_axis] = \
        coordinates_2d[label_x_axis] * (-1)
    if flip_y_axis:
        coordinates_2d[label_y_axis] = \
        coordinates_2d[label_y_axis] * (-1)
            
    plt.figure(figsize = (15,10))
    p1=sns.scatterplot(
        data=coordinates_2d, x=label_x_axis, y=label_y_axis)
    
    x = coordinates_2d[label_x_axis]
    y = coordinates_2d[label_y_axis]
    label = coordinates_2d[label_label]
    
    texts = [plt.text(x[i], y[i], label[i]) for i in range(len(x))]
    adjustText.adjust_text(texts)

fruits = ['apple','orange','banana','lemon','car','tram','boat','bicycle',
          'cherry','mango','grape','durian','watermelon','train','motorbike','ship',  
        'peach','pear','pomegranate','strawberry','bike','bus','truck','subway','airplane']

plot_2d_representation_of_words(
    word_list = fruits, 
    word_vectors = word_vectors, 
    flip_y_axis = True)

這里我在水果類的單詞列表中混入了少許交通工具類的單詞。顯然，結(jié)果還算不錯，不僅能明顯看到單詞之間的相關(guān)性，還能自動聚類。

當(dāng)然，上述只是Word2Vec模型的簡單操作和應(yīng)用，其既可以執(zhí)行詞語層面的任務(wù)，也可以作為很多模型的輸入，包括但不限于：

· 計算相似度

尋找相似詞
信息檢索

· 作為SVM/LSTM等模型的輸入

中文分詞
命名體識別

· 句子表示

情感分析

· 文檔表示

文檔主題判別

總結(jié)

從上述Word2Vec實操和簡單應(yīng)用來看，我們可以得出其詞向量訓(xùn)練的核心思想：若兩個單詞出現(xiàn)的語境相似，則它們的向量也相似。

以上為個人經(jīng)驗，希望能給大家一個參考，也希望大家多多支持腳本之家。

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

關(guān)于Word2Vec可視化展示

目錄

Word2Vec簡介

簡單可視化

實戰(zhàn)演練

總結(jié)

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具