keras 簡(jiǎn)單 lstm實(shí)例(基于one-hot編碼)

更新時(shí)間：2020年07月02日 10:19:59 作者：趕圩歸來(lái)阿理理

這篇文章主要介紹了keras 簡(jiǎn)單 lstm實(shí)例(基于one-hot編碼)，具有很好的參考價(jià)值，希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧

簡(jiǎn)單的LSTM問(wèn)題，能夠預(yù)測(cè)一句話的下一個(gè)字詞是什么

固定長(zhǎng)度的句子，一個(gè)句子有3個(gè)詞。

使用one-hot編碼

各種引用

import keras
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
import numpy as np

數(shù)據(jù)預(yù)處理

data = 'abcdefghijklmnopqrstuvwxyz'
data_set = set(data)
 
word_2_int = {b:a for a,b in enumerate(data_set)}
int_2_word = {a:b for a,b in enumerate(data_set)}
 
word_len = len(data_set)
print(word_2_int)
print(int_2_word)

一些輔助函數(shù)

def words_2_ints(words):
 ints = []
 for itmp in words:
  ints.append(word_2_int[itmp])
 return ints
 
print(words_2_ints('ab'))
 
def words_2_one_hot(words, num_classes=word_len):
 return keras.utils.to_categorical(words_2_ints(words), num_classes=num_classes)
print(words_2_one_hot('a'))
def get_one_hot_max_idx(one_hot):
 idx_ = 0
 max_ = 0
 for i in range(len(one_hot)):
  if max_ < one_hot[i]:
   max_ = one_hot[i]
   idx_ = i
 return idx_
 
def one_hot_2_words(one_hot):
 tmp = []
 for itmp in one_hot:
  tmp.append(int_2_word[get_one_hot_max_idx(itmp)])
 return "".join(tmp)
 
print( one_hot_2_words(words_2_one_hot('adhjlkw')) )

構(gòu)造樣本

time_step = 3 #一個(gè)句子有3個(gè)詞
 
def genarate_data(batch_size=5, genarate_num=100):
 #genarate_num = -1 表示一直循環(huán)下去,genarate_num=1表示生成一個(gè)batch的數(shù)據(jù)，以此類推
 #這里，我也不知道數(shù)據(jù)有多少，就這么循環(huán)的生成下去吧。
 #入?yún)atch_size 控制一個(gè)batch 有多少數(shù)據(jù)，也就是一次要yield進(jìn)多少個(gè)batch_size的數(shù)據(jù)
 '''
 例如，一個(gè)batch有batch_size=5個(gè)樣本，那么對(duì)于這個(gè)例子，需要yield進(jìn)的數(shù)據(jù)為：
 abc->d
 bcd->e
 cde->f
 def->g
 efg->h
 然后把這些數(shù)據(jù)都轉(zhuǎn)換成one-hot形式，最終數(shù)據(jù)，輸入x的形式為：
 
 [第1個(gè)batch]
 [第2個(gè)batch]
 ...
 [第genarate_num個(gè)batch]
 
 每個(gè)batch的形式為：
 
 [第1句話（如abc）]
 [第2句話（如bcd）]
 ...
 每一句話的形式為：
 
 [第1個(gè)詞的one-hot表示]
 [第2個(gè)詞的one-hot表示]
 ...
 '''
 cnt = 0
 batch_x = []
 batch_y = []
 sample_num = 0
 while(True):
  for i in range(len(data) - time_step):
   batch_x.append(words_2_one_hot(data[i : i+time_step]))
   batch_y.append(words_2_one_hot(data[i+time_step])[0]) #這里數(shù)據(jù)加[0]，是為了符合keras的輸出數(shù)據(jù)格式。 因?yàn)椴患覽0]，表示是3維的數(shù)據(jù)。 你可以自己嘗試不加0，看下面的test打印出來(lái)是什么
   sample_num += 1
   #print('sample num is :', sample_num)
   if len(batch_x) == batch_size:
    yield (np.array(batch_x), np.array(batch_y))
    batch_x = []
    batch_y = []
    if genarate_num != -1:
     cnt += 1
 
    if cnt == genarate_num:
     return
   
for test in genarate_data(batch_size=3, genarate_num=1):
 print('--------x:')
 print(test[0])
 print('--------y:')
 print(test[1])

搭建模型并訓(xùn)練

model = Sequential()
 
# LSTM輸出維度為 128
# input_shape控制輸入數(shù)據(jù)的形態(tài)
# time_stemp表示一句話有多少個(gè)單詞
# word_len 表示一個(gè)單詞用多少維度表示，這里是26維
 
model.add(LSTM(128, input_shape=(time_step, word_len)))
model.add(Dense(word_len, activation='softmax')) #輸出用一個(gè)softmax，來(lái)分類，維度就是26，預(yù)測(cè)是哪一個(gè)字母
 
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
 
model.fit_generator(generator=genarate_data(batch_size=5, genarate_num=-1), epochs=50, steps_per_epoch=10)
#steps_per_epoch的意思是，一個(gè)epoch中，執(zhí)行多少個(gè)batch
#batch_size是一個(gè)batch中，有多少個(gè)樣本。
#所以，batch_size*steps_per_epoch就等于一個(gè)epoch中，訓(xùn)練的樣本數(shù)量。(這個(gè)說(shuō)法不對(duì)！再觀察看看吧)
#可以將epochs設(shè)置成1，或者2，然后在genarate_data中打印樣本序號(hào)，觀察到樣本總數(shù)。

使用訓(xùn)練后的模型進(jìn)行預(yù)測(cè)：

result = model.predict(np.array([words_2_one_hot('bcd')]))

print(one_hot_2_words(result))

可以看到，預(yù)測(cè)結(jié)果為

e

補(bǔ)充知識(shí)：訓(xùn)練集產(chǎn)生的onehot編碼特征如何在測(cè)試集、預(yù)測(cè)集復(fù)現(xiàn)

數(shù)據(jù)處理中有時(shí)要用到onehot編碼，如果使用pandas自帶的get_dummies方法，訓(xùn)練集產(chǎn)生的onehot編碼特征會(huì)跟測(cè)試集、預(yù)測(cè)集不一樣，正確的方式是使用sklearn自帶的OneHotEncoder。

代碼

import pandas as pd
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore')
data_train=pd.DataFrame({'職業(yè)':['數(shù)據(jù)挖掘工程師','數(shù)據(jù)庫(kù)開(kāi)發(fā)工程師','數(shù)據(jù)分析師','數(shù)據(jù)分析師'],
     '籍貫':['福州','廈門(mén)','泉州','龍巖']})
ohe.fit(data_train)#訓(xùn)練規(guī)則
feature_names=ohe.get_feature_names(data_train.columns)#獲取編碼后的特征名
data_train_onehot=pd.DataFrame(ohe.transform(data_train).toarray(),columns=feature_names)#應(yīng)用規(guī)則在訓(xùn)練集上
 
data_new=pd.DataFrame({'職業(yè)':['數(shù)據(jù)挖掘工程師','jave工程師'],
     '籍貫':['福州','莆田']})
data_new_onehot=pd.DataFrame(ohe.transform(data_new).toarray(),columns=feature_names)#應(yīng)用規(guī)則在預(yù)測(cè)集上

以上這篇keras 簡(jiǎn)單 lstm實(shí)例(基于one-hot編碼)就是小編分享給大家的全部?jī)?nèi)容了，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。

您可能感興趣的文章: