Python機器學習NLP自然語言處理基本操作之京東評論分類

更新時間：2021年10月18日 15:18:26 作者：我是小白呀

自然語言處理( Natural Language Processing, NLP)是計算機科學領(lǐng)域與人工智能領(lǐng)域中的一個重要方向。它研究能實現(xiàn)人與計算機之間用自然語言進行有效通信的各種理論和方法

概述

從今天開始我們將開啟一段自然語言處理 (NLP) 的旅程. 自然語言處理可以讓來處理, 理解, 以及運用人類的語言, 實現(xiàn)機器語言和人類語言之間的溝通橋梁.

在這里插入圖片描述

RNN

RNN (Recurrent Neural Network), 即循環(huán)神經(jīng)網(wǎng)絡. RNN 相較于 CNN, 可以幫助我們更好的處理序列信息, 挖掘前后信息之間的聯(lián)系. 對于 NLP 這類的任務, 語料的前后概率有極大的聯(lián)系. 比如: “明天天氣真好” 的概率 > “明天天氣籃球”.

在這里插入圖片描述

權(quán)重共享

傳統(tǒng)神經(jīng)網(wǎng)絡:

在這里插入圖片描述

RNN:

在這里插入圖片描述

RNN 的權(quán)重共享和 CNN 的權(quán)重共享類似, 不同時刻共享一個權(quán)重, 大大減少了參數(shù)數(shù)量.

計算過程

在這里插入圖片描述

計算狀態(tài) (State)

在這里插入圖片描述

計算輸出:

在這里插入圖片描述

LSTM

LSTM (Long Short Term Memory), 即長短期記憶模型. LSTM 是一種特殊的 RNN 模型, 解決了長序列訓練過程中的梯度消失和梯度爆炸的問題. 相較于普通 RNN, LSTM 能夠在更長的序列中有更好的表現(xiàn). 相比 RNN 只有一個傳遞狀態(tài) ht, LSTM 有兩個傳遞狀態(tài)： ct (cell state) 和 ht (hidden state).

在這里插入圖片描述

階段

LSTM 通過門來控制傳輸狀態(tài)。

LSTM 總共分為三個階段:

忘記階段: 對上一個節(jié)點傳進來的輸入進行選擇性忘記
選擇記憶階段: 將這個階段的記憶有選擇性的進行記憶. 哪些重要則著重記錄下來, 哪些不重要, 則少記錄一些
輸出階段: 決定哪些將會被當成當前狀態(tài)的輸出

數(shù)據(jù)介紹

約 3 萬條評論數(shù)據(jù), 分為好評和差評.

在這里插入圖片描述

好評:

0 做父母一定要有劉墉這樣的心態(tài)，不斷地學習，不斷地進步，不斷地給自己補充新鮮血液，讓自己保持一...
1 作者真有英國人嚴謹?shù)娘L格，提出觀點、進行論述論證，盡管本人對物理學了解不深，但是仍然能感受到...
2 作者長篇大論借用詳細報告數(shù)據(jù)處理工作和計算結(jié)果支持其新觀點。為什么荷蘭曾經(jīng)縣有歐洲最高的生產(chǎn)...
3 作者在戰(zhàn)幾時之前用了＂擁抱＂令人叫絕．日本如果沒有戰(zhàn)敗，就有會有美軍的占領(lǐng)，沒胡官僚主義的延...
4 作者在少年時即喜閱讀，能看出他精讀了無數(shù)經(jīng)典，因而他有一個龐大的內(nèi)心世界。他的作品最難能可貴...
5 作者有一種專業(yè)的謹慎，若能有幸學習原版也許會更好，簡體版的書中的印刷錯誤比較多，影響學者理解...
6 作者用詩一樣的語言把如水般清澈透明的思想娓娓道來，像一個經(jīng)驗豐富的智慧老人為我們解開一個又一...
7 作者提出了一種工作和生活的方式，作為咨詢界的元老，不僅能提出理念，而且能夠身體力行地實踐，并...
8 作者妙語連珠，將整個60-70年代用層出不窮的搖滾巨星與自身故事緊緊相連什么是鄉(xiāng)愁？什么是搖...
9 作者邏輯嚴密，一氣呵成。沒有一句廢話，深入淺出，循循善誘，環(huán)環(huán)相扣。讓平日里看到指標圖釋就頭...

差評:

0 做為一本聲名在外的流行書，說的還是廣州的外企，按道理應該和我的生存環(huán)境差不多啊。但是一看之下...
1 作者有明顯的自戀傾向，只有有老公養(yǎng)不上班的太太們才能像她那樣生活。很多方法都不實用，還有抄襲...
2 作者完全是以一個過來的自認為是成功者的角度去寫這個問題，感覺很不客觀。雖然不是很喜歡，但是，...
3 作者提倡內(nèi)調(diào)，不信任化妝品，這點贊同。但是所列舉的方法太麻煩，配料也不好找。不是太實用。
4 作者的文筆一般，觀點也是和市面上的同類書大同小異，不推薦讀者購買。
5 作者的文筆還行，但通篇感覺太瑣碎，有點文人的無病呻吟。自由主義者。作者的品性不敢茍同，無民族...
6 作者倒是個很小資的人,但有點自戀的感覺,書并沒有什么大幫助
7 作為一本描寫過去年代感情生活的小說，作者明顯生活經(jīng)驗不足，并且文字功底極其一般，看后感覺浪費...
8 作為個人經(jīng)驗在網(wǎng)上談談可以，但拿來出書就有點過了，書中還有些明顯的謬誤。不過文筆還不錯，建議...
9 昨天剛興奮地寫了評論,今天便遇一鬧心事,因把此套書推薦給很多朋友,朋友就拖我在網(wǎng)上購,結(jié)果前...

代碼

預處理

import numpy as np
import pandas as pd
import jieba


# 讀取停用詞
stop_words = pd.read_csv("stopwords.txt", index_col=None, names=["stop_word"])
stop_words = stop_words["stop_word"].values.tolist()

def load_data():

    # 讀取數(shù)據(jù)
    neg = pd.read_excel("neg.xls", header=None)
    pos = pd.read_excel("pos.xls", header=None)

    # 調(diào)試輸出
    print(neg.head(10))
    print(pos.head(10))

    # 組合
    x = np.concatenate((pos[0], neg[0]))
    y = np.concatenate((np.ones(len(pos), dtype=int), np.zeros(len(neg), dtype=int)))

    # 生成df
    data = pd.DataFrame({"content": x, "label": y})
    print(data.head())


    data.to_csv("data.csv")

def pre_process(text):

    # 分詞
    text = jieba.lcut(text)


    # 去除數(shù)字
    text = [w for w in text if not str(w).isdigit()]

    # 去除左右空格
    text = list(filter(lambda w: w.strip(), text))

    # # 去除長度為1的字符
    # text = list(filter(lambda w: len(w) > 1, text))

    # 去除停用
    text = list(filter(lambda w: w not in stop_words, text))

    return " ".join(text)

if __name__ == '__main__':

    # 讀取數(shù)據(jù)
    data = pd.read_csv("data.csv")

    # 預處理
    data["content"] = data["content"].apply(pre_process)

    # 保存
    data.to_csv("processed.csv", index=False)

主函數(shù)

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split


def tokenizer():

    # 讀取數(shù)據(jù)
    data = pd.read_csv("processed.csv", index_col=False)
    print(data.head())

    # 轉(zhuǎn)換成元組
    X = tuple(data["content"])

    # 實例化tokenizer
    tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)

    # 擬合
    tokenizer.fit_on_texts(X)

    # 詞袋
    word_index = tokenizer.word_index
    # print(word_index)
    print(len(word_index))

    # 轉(zhuǎn)換
    sequence = tokenizer.texts_to_sequences(X)

    # 填充
    characters = tf.keras.preprocessing.sequence.pad_sequences(sequence, maxlen=100)

    # 標簽轉(zhuǎn)換
    labels = tf.keras.utils.to_categorical(data["label"])

    # 分割數(shù)據(jù)集
    X_train, X_test, y_train, y_test = train_test_split(characters, labels, test_size=0.2,
                                                        random_state=0)

    return X_train, X_test, y_train, y_test


def main():

    # 讀取分詞數(shù)據(jù)
    X_train, X_test, y_train, y_test = tokenizer()
    print(X_train[:5])
    print(y_train[:5])

    # 超參數(shù)
    EMBEDDING_DIM = 200  # embedding 維度
    optimizer = tf.keras.optimizers.RMSprop()  # 優(yōu)化器
    loss = tf.losses.CategoricalCrossentropy(from_logits=True)  # 損失

    # 模型
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(30001, EMBEDDING_DIM),
        tf.keras.layers.LSTM(200, dropout=0.2, recurrent_dropout=0.2),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(64, activation="relu"),
        tf.keras.layers.Dense(2, activation="softmax")
    ])
    model.build(input_shape=[None, 20])
    print(model.summary())

    # 組合
    model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

    # 保存
    checkpoint = tf.keras.callbacks.ModelCheckpoint("model/jindong.h5py", monitor='val_accuracy', verbose=1,
                                                    save_best_only=True,
                                                    mode='max')

    # 訓練
    model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=32, callbacks=[checkpoint])


if __name__ == '__main__':
    main()

輸出結(jié)果:

Unnamed: 0 content label
0 0 做父母一定要有劉墉這樣的心態(tài) 不斷地學習不斷地進步不斷地給 ... 1
1 1 作者真有英國人嚴謹的風格提出觀點進行論述論證盡管本人對物理學了... 1
2 2 作者長篇大論借用詳細報告數(shù)據(jù)處理工作和計算結(jié)果支持其新觀點為什么荷... 1
3 3 作者在戰(zhàn) 幾時之前用了＂擁抱＂令人叫絕．日本如果沒有戰(zhàn)敗就 ... 1
4 4 作者在少年時即喜閱讀能看出他精讀了無數(shù) 經(jīng)典因而他有一個龐大... 1
49366
[[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 205 1808 119 40 56 2139 1246 434 3594 1321 1715
9 165 15 22]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 1157 8 3018 1 62 851 34 4 23 455 365
46 239 1157 3903]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 1579 53 388 958 294 1146 18 1 49 1146 305
2365 1 496 235]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 213 4719 509
730 21403 524 42]
[ 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 105 159 1 5 16 11
24 2 299 294 8 39 306 16796 11 1778 29 2674
640 2 543 1820]]
[[0. 1.]
[0. 1.]
[1. 0.]
[1. 0.]
[1. 0.]]
2021-09-20 18:59:07.031583: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2021-09-20 18:59:07.031928: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-09-20 18:59:07.037546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: DESKTOP-VVCH1JQ
2021-09-20 18:59:07.037757: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: DESKTOP-VVCH1JQ
2021-09-20 18:59:07.043925: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, None, 200) 6000200
_________________________________________________________________
lstm (LSTM) (None, 200) 320800
_________________________________________________________________
dropout (Dropout) (None, 200) 0
_________________________________________________________________
dense (Dense) (None, 64) 12864
_________________________________________________________________
dense_1 (Dense) (None, 2) 130
=================================================================
Total params: 6,333,994
Trainable params: 6,333,994
Non-trainable params: 0
_________________________________________________________________
None
2021-09-20 18:59:07.470578: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
Epoch 1/2
C:\Users\Windows\Anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:4870: UserWarning: "`categorical_crossentropy` received `from_logits=True`, but the `output` argument was produced by a sigmoid or softmax activation and thus does not represent logits. Was this intended?"
'"`categorical_crossentropy` received `from_logits=True`, but '
528/528 [==============================] - 272s 509ms/step - loss: 0.3762 - accuracy: 0.8476 - val_loss: 0.2835 - val_accuracy: 0.8839

Epoch 00001: val_accuracy improved from -inf to 0.88391, saving model to model\jindong.h5py
2021-09-20 19:03:40.563733: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
Epoch 2/2
528/528 [==============================] - 299s 566ms/step - loss: 0.2069 - accuracy: 0.9266 - val_loss: 0.2649 - val_accuracy: 0.9005

Epoch 00002: val_accuracy improved from 0.88391 to 0.90050, saving model to model\jindong.h5py