Python統(tǒng)計純文本文件中英文單詞出現(xiàn)個數(shù)的方法總結(jié)【測試可用】
本文實例講述了Python統(tǒng)計純文本文件中英文單詞出現(xiàn)個數(shù)的方法。分享給大家供大家參考,具體如下:
第一版: 效率低
# -*- coding:utf-8 -*- #!python3 path = 'test.txt' with open(path,encoding='utf-8',newline='') as f: word = [] words_dict= {} for letter in f.read(): if letter.isalnum(): word.append(letter) elif letter.isspace(): #空白字符 空格 \t \n if word: word = ''.join(word).lower() #轉(zhuǎn)小寫 if word not in words_dict: words_dict[word] = 1 else: words_dict[word] += 1 word = [] #處理最后一個單詞 if word: word = ''.join(word).lower() # 轉(zhuǎn)小寫 if word not in words_dict: words_dict[word] = 1 else: words_dict[word] += 1 word = [] for k,v in words_dict.items(): print(k,v)
運行結(jié)果:
we 4
are 1
busy 1
all 1
day 1
like 1
swarms 1
of 6
flies 1
without 1
souls 1
noisy 1
restless 1
unable 1
to 1
hear 1
the 7
voices 1
soul 1
as 1
time 1
goes 1
by 1
childhood 1
away 2
grew 1
up 1
years 1
a 1
lot 1
memories 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence 1
regardless 1
shackles 1
mind 1
indulge 1
in 1
world 1
buckish 1
focus 1
on 1
beneficial 1
principle 1
lost 1
themselves 1
第二版:
缺點:遇到大文件要一次讀入內(nèi)存,性能不好
# -*- coding:utf-8 -*- #!python3 import re path = 'test.txt' with open(path,'r',encoding='utf-8') as f: data = f.read() word_reg = re.compile(r'\w+') #word_reg = re.compile(r'\w+\b') word_list = word_reg.findall(data) word_list = [word.lower() for word in word_list] #轉(zhuǎn)小寫 word_set = set(word_list) #避免重復(fù)查詢 # words_dict = {} # for word in word_set: # words_dict[word] = word_list.count(word) # 簡潔寫法 words_dict = {word: word_list.count(word) for word in word_set} for k,v in words_dict.items(): print(k,v)
運行結(jié)果:
on 1
also 1
souls 1
focus 1
soul 1
time 1
noisy 1
grew 1
lot 1
childish 1
like 1
voices 1
indulge 1
swarms 1
buckish 1
restless 1
we 4
hear 1
childhood 1
as 1
world 1
themselves 1
are 1
bottom 1
memories 1
the 7
of 6
flies 1
without 1
have 2
day 1
busy 1
to 1
eroded 1
regardless 1
unable 1
innocence 1
up 1
a 1
in 1
mind 1
goes 1
by 1
lost 1
principle 1
once 1
away 2
years 1
beneficial 1
all 1
shackles 1
第三版:
# -*- coding:utf-8 -*- #!python3 import re path = 'test.txt' with open(path, 'r', encoding='utf-8') as f: word_list = [] word_reg = re.compile(r'\w+') for line in f: #line_words = word_reg.findall(line) #比上面的正則更加簡單 line_words = line.split() word_list.extend(line_words) word_set = set(word_list) # 避免重復(fù)查詢 words_dict = {word: word_list.count(word) for word in word_set} for k, v in words_dict.items(): print(k, v)
運行結(jié)果:
childhood 1
innocence, 1
are 1
of 6
also 1
lost 1
We 1
regardless 1
noisy, 1
by, 1
on 1
themselves. 1
grew 1
lot 1
bottom 1
buckish, 1
time 1
childish 1
voices 1
once 1
restless, 1
shackles 1
world 1
eroded 1
As 1
all 1
day, 1
swarms 1
we 3
soul. 1
memories, 1
in 1
without 1
like 1
beneficial 1
up, 1
unable 1
away 1
flies 1
goes 1
a 1
have 2
away, 1
mind, 1
focus 1
principle, 1
hear 1
to 1
the 7
years 1
busy 1
souls, 1
indulge 1
第四版:使用Counter
統(tǒng)計
# -*- coding:utf-8 -*- #!python3 import collections import re path = 'test.txt' with open(path, 'r', encoding='utf-8') as f: word_list = [] word_reg = re.compile(r'\w+') for line in f: line_words = line.split() word_list.extend(line_words) words_dict = dict(collections.Counter(word_list)) #使用Counter統(tǒng)計 for k, v in words_dict.items(): print(k, v)
運行結(jié)果:
We 1
are 1
busy 1
all 1
day, 1
like 1
swarms 1
of 6
flies 1
without 1
souls, 1
noisy, 1
restless, 1
unable 1
to 1
hear 1
the 7
voices 1
soul. 1
As 1
time 1
goes 1
by, 1
childhood 1
away, 1
we 3
grew 1
up, 1
years 1
away 1
a 1
lot 1
memories, 1
once 1
have 2
also 1
eroded 1
bottom 1
childish 1
innocence, 1
regardless 1
shackles 1
mind, 1
indulge 1
in 1
world 1
buckish, 1
focus 1
on 1
beneficial 1
principle, 1
lost 1
themselves. 1
注:這里使用的測試文本test.txt如下:
We are busy all day, like swarms of flies without souls, noisy, restless, unable to hear the voices of the soul. As time goes by, childhood away, we grew up, years away a lot of memories, once have also eroded the bottom of the childish innocence, we regardless of the shackles of mind, indulge in the world buckish, focus on the beneficial principle, we have lost themselves.
PS:這里再為大家推薦2款相關(guān)統(tǒng)計工具供大家參考:
在線字數(shù)統(tǒng)計工具:
http://tools.jb51.net/code/zishutongji
在線字符統(tǒng)計與編輯工具:
http://tools.jb51.net/code/char_tongji
更多關(guān)于Python相關(guān)內(nèi)容感興趣的讀者可查看本站專題:《Python文件與目錄操作技巧匯總》、《Python文本文件操作技巧匯總》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》及《Python入門與進階經(jīng)典教程》
希望本文所述對大家Python程序設(shè)計有所幫助。
相關(guān)文章
Python使用base64模塊進行二進制數(shù)據(jù)編碼詳解
這篇文章主要介紹了Python使用base64模塊進行二進制數(shù)據(jù)編碼詳解,具有一定借鑒價值,需要的朋友可以參考下2018-01-01Python 2.x如何設(shè)置命令執(zhí)行的超時時間實例
這篇文章主要給大家介紹了關(guān)于Python 2.x如何設(shè)置命令執(zhí)行超時時間的相關(guān)資料,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習或者工作具有一定的參考學(xué)習價值,需要的朋友可以參考借鑒,下面來一起看看吧。2017-10-10caffe的python接口繪制loss和accuracy曲線
這篇文章主要為大家介紹了caffe的python接口繪制loss和accuracy曲線示例詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪2022-06-06Python入門教程(十八)Python的For循環(huán)
這篇文章主要介紹了Python入門教程(十八)Python的For循環(huán),Python是一門非常強大好用的語言,也有著易上手的特性,本文為入門教程,需要的朋友可以參考下2023-04-04