快捷導(dǎo)航

Pytorch BertModel的使用說明

更新時(shí)間：2021年03月26日 14:40:19 作者：無(wú)聊的人生事無(wú)聊

這篇文章主要介紹了Pytorch BertModel的使用說明，具有很好的參考價(jià)值，希望對(duì)大家有所幫助。一起跟隨小編過來看看吧

基本介紹

環(huán)境: Python 3.5+, Pytorch 0.4.1/1.0.0

安裝:

pip install pytorch-pretrained-bert

必需參數(shù):

--data_dir: "str": 數(shù)據(jù)根目錄.目錄下放著,train.xxx/dev.xxx/test.xxx三個(gè)數(shù)據(jù)文件.

--vocab_dir: "str": 詞庫(kù)文件地址.

--bert_model: "str": 存放著bert預(yù)訓(xùn)練好的模型. 需要是一個(gè)gz文件, 如"..x/xx/bert-base-chinese.tar.gz ", 里面包含一個(gè)bert_config.json和pytorch_model.bin文件.

--task_name: "str": 用來選擇對(duì)應(yīng)數(shù)據(jù)集的參數(shù),如"cola",對(duì)應(yīng)著數(shù)據(jù)集.

--output_dir: "str": 模型預(yù)測(cè)結(jié)果和模型參數(shù)存儲(chǔ)目錄.

簡(jiǎn)單例子:

導(dǎo)入所需包

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

創(chuàng)建分詞器

tokenizer = BertTokenizer.from_pretrained(--vocab_dir)

需要參數(shù): --vocab_dir，數(shù)據(jù)樣式見此

擁有函數(shù):

tokenize: 輸入句子，根據(jù)--vocab_dir和貪心原則切詞. 返回單詞列表

convert_token_to_ids: 將切詞后的列表轉(zhuǎn)換為詞庫(kù)對(duì)應(yīng)id列表.

convert_ids_to_tokens: 將id列表轉(zhuǎn)換為單詞列表.

text = '[CLS] 武松打老虎 [SEP] 你在哪 [SEP]'
tokenized_text = tokenizer.tokenize(text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [0, 0, 0, 0, 0, 0, 0,0,0,0, 1,1, 1, 1, 1, 1, 1, 1]
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

這里對(duì)標(biāo)記符號(hào)的切詞似乎有問題([cls]/[sep])，而且中文bert是基于字級(jí)別編碼的，因此切出來的都是一個(gè)一個(gè)漢字:

['[', 'cl', '##s', ']', '武', '松', '打', '老', '虎', '[', 'sep', ']', '你', '在', '哪', '[', 'sep', ']']

創(chuàng)建bert模型并加載預(yù)訓(xùn)練模型:

model = BertModel.from_pretrained(--bert_model)

放入GPU:

tokens_tensor = tokens_tensor.cuda()
segments_tensors = segments_tensors.cuda()
model.cuda()

前向傳播:

encoded_layers, pooled_output= model(tokens_tensor, segments_tensors)

參數(shù):

input_ids: (batch_size, sqe_len)代表輸入實(shí)例的Tensor

token_type_ids=None: (batch_size, sqe_len)一個(gè)實(shí)例可以含有兩個(gè)句子，這個(gè)相當(dāng)于句子標(biāo)記.

attention_mask=None: (batch_size*): 傳入每個(gè)實(shí)例的長(zhǎng)度，用于attention的mask.

output_all_encoded_layers=True: 控制是否輸出所有encoder層的結(jié)果.

返回值:

encoded_layer：長(zhǎng)度為num_hidden_layers的(batch_size， sequence_length，hidden_size)的Tensor.列表

pooled_output: (batch_size, hidden_size), 最后一層encoder的第一個(gè)詞[CLS]經(jīng)過Linear層和激活函數(shù)Tanh()后的Tensor. 其代表了句子信息

補(bǔ)充：pytorch使用Bert

主要分為以下幾個(gè)步驟：

下載模型放到目錄中

使用transformers中的BertModel，BertTokenizer來加載模型與分詞器

使用tokenizer的encode和decode 函數(shù)分別編碼與解碼，注意參數(shù)add_special_tokens和skip_special_tokens

forward的輸入是一個(gè)[batch_size, seq_length]的tensor，再需要注意的是attention_mask參數(shù)。

輸出是一個(gè)tuple，tuple的第一個(gè)值是bert的最后一個(gè)transformer層的hidden_state，size是[batch_size, seq_length, hidden_size]，也就是bert最后的輸出，再用于下游的任務(wù)。

# -*- encoding: utf-8 -*-
import warnings
warnings.filterwarnings('ignore')
from transformers import BertModel, BertTokenizer, BertConfig
import os
from os.path import dirname, abspath
root_dir = dirname(dirname(dirname(abspath(__file__))))
import torch
# 把預(yù)訓(xùn)練的模型從官網(wǎng)下載下來放到目錄中
pretrained_path = os.path.join(root_dir, 'pretrained/bert_zh')
# 從文件中加載bert模型
model = BertModel.from_pretrained(pretrained_path)
# 從bert目錄中加載詞典
tokenizer = BertTokenizer.from_pretrained(pretrained_path)
print(f'vocab size :{tokenizer.vocab_size}')
# 把'[PAD]'編碼
print(tokenizer.encode('[PAD]'))
print(tokenizer.encode('[SEP]'))
# 把中文句子編碼，默認(rèn)加入了special tokens了，也就是句子開頭加入了[CLS] 句子結(jié)尾加入了[SEP]
ids = tokenizer.encode("我是中國(guó)人", add_special_tokens=True)
# 從結(jié)果中看，101是[CLS]的id，而2769是"我"的id
# [101, 2769, 3221, 704, 1744, 782, 102]
print(ids)
# 把ids解碼為中文，默認(rèn)是沒有跳過特殊字符的
print(tokenizer.decode([101, 2769, 3221, 704, 1744, 782, 102], skip_special_tokens=False))
# print(model)
inputs = torch.tensor(ids).unsqueeze(0)
# forward，result是一個(gè)tuple，第一個(gè)tensor是最后的hidden-state
result = model(torch.tensor(inputs))
# [1, 5, 768]
print(result[0].size())
# [1, 768]
print(result[1].size())
for name, parameter in model.named_parameters():
  # 打印每一層，及每一層的參數(shù)
  print(name)
  # 每一層的參數(shù)默認(rèn)都requires_grad=True的，參數(shù)是可以學(xué)習(xí)的
  print(parameter.requires_grad)
  # 如果只想訓(xùn)練第11層transformer的參數(shù)的話：
  if '11' in name:
    parameter.requires_grad = True
  else:
    parameter.requires_grad = False
print([p.requires_grad for name, p in model.named_parameters()])

添加atten_mask的方法：

其中101是[CLS]，102是[SEP]，0是[PAD]

>>> a
tensor([[101,  3,  4, 23, 11,  1, 102,  0,  0,  0]])
>>> notpad = a!=0
>>> notpad
tensor([[ True, True, True, True, True, True, True, False, False, False]])
>>> notcls = a!=101
>>> notcls
tensor([[False, True, True, True, True, True, True, True, True, True]])
>>> notsep = a!=102
>>> notsep
tensor([[ True, True, True, True, True, True, False, True, True, True]])
>>> mask = notpad & notcls & notsep
>>> mask
tensor([[False, True, True, True, True, True, False, False, False, False]])
>>>

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。如有錯(cuò)誤或未考慮完全的地方，望不吝賜教。

您可能感興趣的文章: