PyTorch如何使用embedding對特征向量進行嵌入

更新時間：2024年02月27日 09:30:57 作者：Vic·Tory

這篇文章主要介紹了PyTorch如何使用embedding對特征向量進行嵌入問題,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教

詞向量嵌入

在NLP中，一個最基本的問題就是如何在計算機中表示一個單詞。一般我們用含有N個單詞的詞匯表來對單詞進行編碼，例如詞表{“hello”: 0, “world”: 1, “nice”:2, “to”:3, “see”:4, “you”:5 }中只有6個單詞，那么nice的編碼就是2，但是一篇文章有成百上千甚至更多的詞匯時，這就需要embedding操作將詞向量進行壓縮，用更小的維度去表示大量的詞匯空間。

根據(jù)語言學的單詞分布假說，出現(xiàn)在相似語境中的詞在語義上是相互關(guān)聯(lián)的。如果僅僅使用一個數(shù)字來表示一個單詞，孤立的對待每個單詞，就無法發(fā)現(xiàn)單詞之間的內(nèi)在聯(lián)系，無法表示向量之間的相似性，即語義相似度。而通過embedding編碼則可以挖掘出不同詞向量之間的相似性，并且根據(jù)已有的詞句對新的句子做出預測。

那么embedding是如何對一個詞匯進行編碼的呢？它從語義屬性的角度考慮進行編碼，例如有mathematicians 和physicists兩個詞匯，假設我們從奔跑能力can run、喜好咖啡程度like coffee、物理研究majored in physics等角度對兩個單詞作出平均，這樣就形成了描述兩個單詞的詞向量記作m=[2.3, 9.4, -5.5]、p=[2.5, 9.1, 6.4]，那么接下來就可以通過夾角大小來表示兩個向量的相似度cosα=m·p / |m||p|，如果兩個向量越相似，其cosα值越接近1，反之趨于-1.這樣就用三維向量對單詞進行了編碼并且挖掘出其內(nèi)在的語義相似性。

mathematician={ "can run" : 2.3, "likes coffee" : 9.4,"majored in Physics" : ?5.5,…}
physicist={ "can run" : 2.5 "likes coffee" : 9.1,"majored in Physics" : 6.4,…}

但是我們可以用成千上萬種不同的方式對mathematicians 和physicists進行描繪，我們應該選取哪些值來表示不同的屬性呢？

與人為的定義屬性相比，深度學習的神經(jīng)網(wǎng)絡可以很好地進行特征學習，所以我們可以讓embedding在訓練過程中自動進行訓練和更新，指導找到合適的屬性來表示詞向量。

經(jīng)過訓練之后的向量表示是沒有實際的語義屬性的，就像如果直接看m=[2.3, 9.4, -5.5]，它只是由三個數(shù)字組成的向量，并沒有我們可以理解的意義。

綜上，我們可以看到PyTorch的embedding有三個特點：可以將高維向量壓縮、發(fā)掘向量之間的聯(lián)系、可以在訓練過程中自動學習和更新。

Embedding函數(shù)

使用torch.nn庫的Embedding()可以構(gòu)建embed模型，它需要傳入兩個參數(shù)，輸入向量的類型數(shù)和輸出向量的維度，即向量共有多少種不同類型的取值、最后希望用幾個數(shù)來表示該向量。

需要注意的是，embedding的輸入向量必須為long類型的tensor。

如下所示，詞表word_to_ix中共有6種不同的單詞，最后希望用一個2維向量表示每個單詞，那么模型的構(gòu)建就是Embedding(6, 2)。

假設我們要對“nice”進行編碼，首先通過詞表找到其編號為2，得到其原始編碼lookup_tensor，經(jīng)過embedding之后得到編碼為nice_embed。

我們也可以一次輸入10個單詞，即tensor(10,)，embedding會作用與每個向量并得到tensor(10,2)的結(jié)果。

import torch
import torch.nn as nn

word_to_ix = {"hello": 0, "world": 1, "nice": 2, "to": 3, "see": 4, "you": 5}
embeds = nn.Embedding(6, 2)  # 構(gòu)建模型
lookup_tensor = torch.tensor([word_to_ix["nice"]], dtype=torch.long)
print(lookup_tensor)
nice_embed = embeds(lookup_tensor)     # 進行編碼
print(nice_embed)
'''
tensor([2])
tensor([[ 0.7950, -0.3999]], grad_fn=<EmbeddingBackward>)
'''

一個具體例子

下面是一個使用embedding進行模型訓練的例子，首先給出的材料是一段莎士比亞的詩，訓練模型根據(jù)前兩個單詞預測第三個單詞。

模型在將單詞進行embedding編碼后送入兩個全連接層后輸出結(jié)果。

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

CONTEXT_SIZE = 2
EMBEDDING_DIM = 10  # 編碼向量的維度

test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# 構(gòu)建訓練集數(shù)據(jù) ([ 第一個單詞, 第二個單詞 ], 預測目標)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# 構(gòu)建測試集數(shù)據(jù)
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

# 定義模型
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))  # 進行embedding
        out = F.relu(self.linear1(embeds))  # 經(jīng)過第一個全連接層
        out = self.linear2(out)  # 經(jīng)過第二個全連接層
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

# 進行訓練
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:
        # 準備輸入模型的數(shù)據(jù)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        model.zero_grad()  # 清零梯度緩存

        # 進行訓練得到預測結(jié)果
        log_probs = model(context_idxs)

        # 計算損失值
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # 反向傳播更新梯度
        loss.backward()
        optimizer.step()

        total_loss += loss.item()  # 累計損失
    losses.append(total_loss)
print(losses)