Python PaddleNLP實現(xiàn)自動生成虎年藏頭詩
一、 數(shù)據(jù)處理
本項目中利用古詩數(shù)據(jù)集作為訓練集,編碼器接收古詩的每個字的開頭,解碼器利用編碼器的信息生成所有的詩句。為了詩句之間的連貫性,編碼器同時也在詩頭之前加上之前詩句的信息。舉例:
“白日依山盡,黃河入海流,欲窮千里目,更上一層樓。” 可以生成兩個樣本:
樣本一:編碼器輸入,“白”;解碼器輸入,“白日依山盡,黃河入海流”
樣本二:編碼器輸入,“白日依山盡,黃河入海流。欲”;解碼器輸入,“欲窮千里目,更上一層樓。”
1.paddlenlp升級
!pip install -U paddlenlp
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting paddlenlp [?25l Downloading https://pypi.tuna.tsinghua.edu.cn/packages/17/9b/4535ccf0e96c302a3066bd2e4d0f44b6b1a73487c6793024475b48466c32/paddlenlp-2.2.3-py3-none-any.whl (1.2MB) [K |████████████████████████████████| 1.2MB 11.2MB/s eta 0:00:01 [?25hRequirement already satisfied, skipping upgrade: h5py in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (2.9.0) Requirement already satisfied, skipping upgrade: colorlog in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (4.1.0) Requirement already satisfied, skipping upgrade: colorama in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.4.4) Requirement already satisfied, skipping upgrade: seqeval in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (1.2.2) Requirement already satisfied, skipping upgrade: jieba in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.42.1) Requirement already satisfied, skipping upgrade: multiprocess in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from paddlenlp) (0.70.11.1) Requirement already satisfied, skipping upgrade: six in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.16.0) Requirement already satisfied, skipping upgrade: numpy>=1.7 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from h5py->paddlenlp) (1.20.3) Requirement already satisfied, skipping upgrade: scikit-learn>=0.21.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from seqeval->paddlenlp) (0.24.2) Requirement already satisfied, skipping upgrade: dill>=0.3.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from multiprocess->paddlenlp) (0.3.3) Requirement already satisfied, skipping upgrade: scipy>=0.19.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (1.6.3) Requirement already satisfied, skipping upgrade: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (2.1.0) Requirement already satisfied, skipping upgrade: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.21.3->seqeval->paddlenlp) (0.14.1) Installing collected packages: paddlenlp Found existing installation: paddlenlp 2.1.1 Uninstalling paddlenlp-2.1.1: Successfully uninstalled paddlenlp-2.1.1 Successfully installed paddlenlp-2.2.3
2.提取詩頭
import re poems_file = open("./data/data70759/poems_zh.txt", encoding="utf8") # 對讀取的每一行詩句,統(tǒng)計每一句的詞頭 poems_samples = [] poems_prefix = [] poems_heads = [] for line in poems_file.readlines(): line_ = re.sub('。', ' ', line) line_ = line_.split() # 生成訓練樣本 for i, p in enumerate(line_): poems_heads.append(p[0]) poems_prefix.append('。'.join(line_[:i])) poems_samples.append(p + '。') # 輸出文件信息 for i in range(20): print("poems heads:{}, poems_prefix: {}, poems:{}".format(poems_heads[i], poems_prefix[i], poems_samples[i]))
poems heads:欲, poems_prefix: , poems:欲出未出光辣達,千山萬山如火發(fā)。 poems heads:須, poems_prefix: 欲出未出光辣達,千山萬山如火發(fā), poems:須臾走向天上來,逐卻殘星趕卻月。 poems heads:未, poems_prefix: , poems:未離海底千山黑,才到天中萬國明。 poems heads:滿, poems_prefix: , poems:滿目江山四望幽,白云高卷嶂煙收。 poems heads:日, poems_prefix: 滿目江山四望幽,白云高卷嶂煙收, poems:日回禽影穿疏木,風遞猿聲入小樓。 poems heads:遠, poems_prefix: 滿目江山四望幽,白云高卷嶂煙收。日回禽影穿疏木,風遞猿聲入小樓, poems:遠岫似屏橫碧落,斷帆如葉截中流。 poems heads:片, poems_prefix: , poems:片片飛來靜又閑,樓頭江上復山前。 poems heads:飄, poems_prefix: 片片飛來靜又閑,樓頭江上復山前, poems:飄零盡日不歸去,帖破清光萬里天。 poems heads:因, poems_prefix: , poems:因登巨石知來處,勃勃元生綠蘚痕。 poems heads:靜, poems_prefix: 因登巨石知來處,勃勃元生綠蘚痕, poems:靜即等閑藏草木,動時頃刻徧乾坤。 poems heads:橫, poems_prefix: 因登巨石知來處,勃勃元生綠蘚痕。靜即等閑藏草木,動時頃刻徧乾坤, poems:橫天未必朋元惡,捧日還曾瑞至尊。 poems heads:不, poems_prefix: 因登巨石知來處,勃勃元生綠蘚痕。靜即等閑藏草木,動時頃刻徧乾坤。橫天未必朋元惡,捧日還曾瑞至尊, poems:不獨朝朝在巫峽,楚王何事謾勞魂。 poems heads:若, poems_prefix: , poems:若教作鎮(zhèn)居中國,爭得泥金在泰山。 poems heads:才, poems_prefix: , poems:才聞暖律先偷眼,既待和風始展眉。 poems heads:嚼, poems_prefix: , poems:嚼處春冰敲齒冷,咽時雪液沃心寒。 poems heads:蒙, poems_prefix: , poems:蒙君知重惠瓊實,薄起金刀釘玉深。 poems heads:深, poems_prefix: , poems:深妝玉瓦平無垅,亂拂蘆花細有聲。 poems heads:片, poems_prefix: , poems:片逐銀蟾落醉觥。 poems heads:巧, poems_prefix: , poems:巧剪銀花亂,輕飛玉葉狂。 poems heads:寒, poems_prefix: , poems:寒艷芳姿色盡明。
3.生成詞表
# 用PaddleNLP生成詞表文件,由于詩文的句式較短,我們以單個字作為詞單元生成詞表 from paddlenlp.data import Vocab vocab = Vocab.build_vocab(poems_samples, unk_token="<unk>", pad_token="<pad>", bos_token="<", eos_token=">") vocab_size = len(vocab) print("vocab size", vocab_size) print("word to idx:", vocab.token_to_idx)
4.定義dataset
# 定義數(shù)據(jù)讀取器 from paddle.io import Dataset, BatchSampler, DataLoader import numpy as np class PoemDataset(Dataset): def __init__(self, poems_data, poems_heads, poems_prefix, vocab, encoder_max_len=128, decoder_max_len=32): super(PoemDataset, self).__init__() self.poems_data = poems_data self.poems_heads = poems_heads self.poems_prefix = poems_prefix self.vocab = vocab self.tokenizer = lambda x: [vocab.token_to_idx[x_] for x_ in x] self.encoder_max_len = encoder_max_len self.decoder_max_len = decoder_max_len def __getitem__(self, idx): eos_id = vocab.token_to_idx[vocab.eos_token] bos_id = vocab.token_to_idx[vocab.bos_token] pad_id = vocab.token_to_idx[vocab.pad_token] # 確保encoder和decoder的輸出都小于最大長度 poet = self.poems_data[idx][:self.decoder_max_len - 2] # -2 包含bos_id和eos_id prefix = self.poems_prefix[idx][- (self.encoder_max_len - 3):] # -3 包含bos_id, eos_id, 和head的編碼 # 對輸入輸出編碼 sample = [bos_id] + self.tokenizer(poet) + [eos_id] prefix = self.tokenizer(prefix) if prefix else [] heads = prefix + [bos_id] + self.tokenizer(self.poems_heads[idx]) + [eos_id] sample_len = len(sample) heads_len = len(heads) sample = sample + [pad_id] * (self.decoder_max_len - sample_len) heads = heads + [pad_id] * (self.encoder_max_len - heads_len) mask = [1] * (sample_len - 1) + [0] * (self.decoder_max_len - sample_len) # -1 to make equal to out[2] out = [np.array(d, "int64") for d in [heads, heads_len, sample, sample, mask]] out[2] = out[2][:-1] out[3] = out[3][1:, np.newaxis] return out def shape(self): return [([None, self.encoder_max_len], 'int64', 'src'), ([None, 1], 'int64', 'src_length'), ([None, self.decoder_max_len - 1],'int64', 'trg')], \ [([None, self.decoder_max_len - 1, 1], 'int64', 'label'), ([None, self.decoder_max_len - 1], 'int64', 'trg_mask')] def __len__(self): return len(self.poems_data) dataset = PoemDataset(poems_samples, poems_heads, poems_prefix, vocab) batch_sampler = BatchSampler(dataset, batch_size=2048) data_loader = DataLoader(dataset, batch_sampler=batch_sampler)
二、定義模型并訓練
1.模型定義
from Seq2Seq.models import Seq2SeqModel from paddlenlp.metrics import Perplexity from Seq2Seq.loss import CrossEntropyCriterion import paddle from paddle.static import InputSpec # 參數(shù) lr = 1e-6 max_epoch = 20 models_save_path = "./checkpoints" encoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "dropout": .2, "direction": "bidirectional", "mode": "GRU"} decoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "direction": "forward", "dropout": .2, "mode": "GRU", "use_attention": True} # inputs shape and label shape inputs_shape, labels_shape = dataset.shape() inputs_list = [InputSpec(input_shape[0], input_shape[1], input_shape[2]) for input_shape in inputs_shape] labels_list = [InputSpec(label_shape[0], label_shape[1], label_shape[2]) for label_shape in labels_shape] net = Seq2SeqModel(encoder_attrs, decoder_attrs) model = paddle.Model(net, inputs_list, labels_list) model.load("./final_models/model") opt = paddle.optimizer.Adam(learning_rate=lr, parameters=model.parameters()) model.prepare(opt, CrossEntropyCriterion(), Perplexity())
W0122 21:03:30.616776 166 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1 W0122 21:03:30.620450 166 device_context.cc:465] device: 0, cuDNN Version: 7.6.
2.模型訓練
# 訓練,訓練時間較長,已提供了訓練好的模型(./final_models/model) model.fit(train_data=data_loader, epochs=max_epoch, eval_freq=1, save_freq=5, save_dir=models_save_path, shuffle=True)
3.模型保存
# 保存 model.save("./final_models/model")
三、生成藏頭詩
import warnings def post_process_seq(seq, bos_idx, eos_idx, output_bos=False, output_eos=False): """ Post-process the decoded sequence. """ eos_pos = len(seq) - 1 for i, idx in enumerate(seq): if idx == eos_idx: eos_pos = i break seq = [idx for idx in seq[:eos_pos + 1] if (output_bos or idx != bos_idx) and (output_eos or idx != eos_idx)] return seq # 定義用于生成祝福語的類 from paddlenlp.data.tokenizer import JiebaTokenizer class GenPoems(): # content (str): the str to generate poems, like "恭喜發(fā)財" # vocab: the instance of paddlenlp.data.vocab.Vocab # model: the Inference Model def __init__(self, vocab, model): self.bos_id = vocab.token_to_idx[vocab.bos_token] self.eos_id = vocab.token_to_idx[vocab.eos_token] self.pad_id = vocab.token_to_idx[vocab.pad_token] self.tokenizer = lambda x: [vocab.token_to_idx[x_] for x_ in x] self.model = model self.vocab = vocab def gen(self, content, max_len=128): # max_len is the encoder_max_len in Seq2Seq Model. out = [] vocab_list = list(vocab.token_to_idx.keys()) for w in content: if w in vocab_list: content = re.sub("([。,])", '', content) heads = out[- (max_len - 3):] + [self.bos_id] + self.tokenizer(w) + [self.eos_id] len_heads = len(heads) heads = heads + [self.pad_id] * (max_len - len_heads) x = paddle.to_tensor([heads], dtype="int64") len_x = paddle.to_tensor([len_heads], dtype='int64') pred = self.model.predict_batch(inputs = [x, len_x])[0] out += self._get_results(pred)[0] else: warnings.warn("{} is not in vocab list, so it is skipped.".format(w)) pass out = ''.join([self.vocab.idx_to_token[id] for id in out]) return out def _get_results(self, pred): pred = pred[:, :, np.newaxis] if len(pred.shape) == 2 else pred pred = np.transpose(pred, [0, 2, 1]) outs = [] for beam in pred[0]: id_list = post_process_seq(beam, self.bos_id, self.eos_id) outs.append(id_list) return outs
# 載入預測模型 from Seq2Seq.models import Seq2SeqInferModel import paddle encoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "dropout": .2, "direction": "bidirectional", "mode": "GRU"} decoder_attrs = {"vocab_size": vocab_size, "embed_dim": 200, "hidden_size": 128, "num_layers": 4, "direction": "forward", "dropout": .2, "mode": "GRU", "use_attention": True} infer_model = paddle.Model(Seq2SeqInferModel(encoder_attrs, decoder_attrs, bos_id=vocab.token_to_idx[vocab.bos_token], eos_id=vocab.token_to_idx[vocab.eos_token], beam_size=10, max_out_len=256)) infer_model.load("./final_models/model")
# 送新年祝福 # 當然,表白也可以 generator = GenPoems(vocab, infer_model) content = "生龍活虎" poet = generator.gen(content) for line in poet.strip().split('。'): try: print("{}\t{}。".format(line[0], line)) except: pass
輸出結果
生 生涯不可見,何處不相逢。
龍 龍虎不知何處,人間不見人間。
活 活人不是人間事,不覺人間不可識。
虎 虎豹相逢不可尋,不知何處不相識。
總結
這個項目介紹了如何訓練一個生成藏頭詩的模型,從結果可以看出,模型已經(jīng)具有一定的生成詩句的能力。但是,限于訓練集規(guī)模和訓練時間,生成的詩句還有很大的改進空間,未來還將進一步優(yōu)化這個模型,敬請期待。
以上就是Python PaddleNLP實現(xiàn)自動生成虎年藏頭詩的詳細內(nèi)容,更多關于PaddleNLP生成藏頭詩的資料請關注腳本之家其它相關文章!
相關文章
通過LyScript實現(xiàn)從文本中讀寫ShellCode
LyScript 插件通過配合內(nèi)存讀寫,可實現(xiàn)對特定位置的ShellCode代碼的導出。本文將利用這一特性實現(xiàn)從文本中讀寫ShellCode,感興趣的可以了解一下2022-08-08python中enumerate函數(shù)遍歷元素用法分析
這篇文章主要介紹了python中enumerate函數(shù)遍歷元素用法,結合實例形式分析了enumerate函數(shù)遍歷元素的相關實現(xiàn)技巧,需要的朋友可以參考下2016-03-03