Python中文分詞工具使用詳解

更新時間：2024年10月24日 10:30:11 作者：matrixlzp

這篇文章主要為大家詳細介紹了Python中文分詞工具的具體使用,文中的示例代碼講解詳細,具有一定的借鑒價值,有需要的小伙伴可以參考一下

一、場景分析

我們平常爬地圖 POI 數(shù)據(jù)的時候，會得到大量的中文地址信息，比如【廈門大學(xué)附屬中山醫(yī)院】這個時候，就需要做中文分詞，以便進一步分析。

二、中文分詞庫試用

1、jieba（結(jié)巴分詞）

pip install jieba

test1.py 代碼如下：

import jieba
 
text = "廈門大學(xué)附屬中山醫(yī)院"
words = jieba.cut(text)
print( list(words) )

運行

py test1.py

2、SnowNLP

pip install snownlp

test2.py 代碼如下：

from snownlp import SnowNLP
 
text = "廈門大學(xué)附屬中山醫(yī)院"
s = SnowNLP(text)
words = s.words
print(words)

運行

py test2.py

3、thulac（清華大學(xué)自然語言處理與社會人文計算實驗室開發(fā)的中文詞法分析工具包）

pip install thulac

test3.py 代碼如下：

import thulac
 
thu = thulac.thulac()
text = "廈門大學(xué)附屬中山醫(yī)院"
result = thu.cut(text)
print(result)

運行

py test3.py

三、總結(jié)

通過試用，發(fā)現(xiàn)三款分詞庫都能準(zhǔn)確的把詞條進行分詞。

thulac 分詞結(jié)果，因為加入了詞性標(biāo)注，結(jié)果比較復(fù)雜。

jieba 的結(jié)果最簡單，也最接近自然語言。

四、實戰(zhàn)案例

從一個 txt 讀入一批中文詞條，進行分詞，然后把分詞結(jié)果寫入 excel 文件中。

test.py 代碼如下：

import jieba
from openpyxl import Workbook
 
# 創(chuàng)建一個新的工作簿
wb = Workbook()
# 選擇默認(rèn)的活動工作表
ws = wb.active
 
# 向工作表中寫入表頭
ws['A1'] = '分詞'
 
# 讀取文件
input_path = r"C:\Users\Administrator\Desktop\py\split words\demo\address.txt"
with open(input_path, 'r', encoding='utf-8') as input_file:
    for line in input_file:
        word = line.strip()
        print("---------"+word)
        words = jieba.cut( word )
        ll = list(words)
        for item in ll:
            print(item.strip())
            temp_list = []
            temp_list.append( item.strip() )
            ws.append(temp_list) 
                 
input_file.close()
# 保存工作簿
wb.save('output.xlsx')

address.txt 如下：