欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

python批量處理PDF文檔輸出自定義關(guān)鍵詞的出現(xiàn)次數(shù)

 更新時(shí)間:2023年04月11日 11:54:12   作者:Ryo_Yuki  
這篇文章主要介紹了python批量處理PDF文檔,輸出自定義關(guān)鍵詞的出現(xiàn)次數(shù),文中有詳細(xì)的代碼示例,需要的朋友可以參考閱讀

函數(shù)模塊介紹

具體的代碼可見(jiàn)全部代碼部分,這部分只介紹思路和相應(yīng)的函數(shù)模塊

對(duì)文件進(jìn)行批量重命名

因?yàn)槲募侵形?,且無(wú)關(guān)于最后的結(jié)果,所以批量命名為數(shù)字
注意如果不是第一次運(yùn)行,即已經(jīng)命名完成,就在主函數(shù)內(nèi)把這個(gè)函數(shù)注釋掉就好了

def rename():
    path='dealPdf'
    filelist=os.listdir(path)
    for i,files in enumerate(filelist):
        Olddir=os.path.join(path,files)
        if os.path.isdir(Olddir):
            continue
        Newdir=os.path.join(path,str(i+1)+'.pdf')
        os.rename(Olddir,Newdir)

將PDF轉(zhuǎn)化為txt

PDF是無(wú)法直接進(jìn)行文本分析的,所以需要將文字轉(zhuǎn)成txt文件(PDF中圖內(nèi)的文字無(wú)法提?。?/p>

#將pdf文件轉(zhuǎn)化成txt文件
def pdf_to_txt(dealPdf,index):
    # 不顯示warning
    logging.propagate = False
    logging.getLogger().setLevel(logging.ERROR)
    pdf_filename = dealPdf
    device = PDFPageAggregator(PDFResourceManager(), laparams=LAParams())
    interpreter = PDFPageInterpreter(PDFResourceManager(), device)    
    parser = PDFParser(open(pdf_filename, 'rb'))
    doc = PDFDocument(parser)
    
    
    txt_filename='dealTxt\\'+str(index)+'.txt'
        
    # 檢測(cè)文檔是否提供txt轉(zhuǎn)換,不提供就忽略
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        with open(txt_filename, 'w', encoding="utf-8") as fw:
            #print("num page:{}".format(len(list(doc.get_pages()))))
            for i,page in enumerate(PDFPage.create_pages(doc)):
                interpreter.process_page(page)
                # 接受該頁(yè)面的LTPage對(duì)象
                layout = device.get_result()
                # 這里layout是一個(gè)LTPage對(duì)象 里面存放著 這個(gè)page解析出的各種對(duì)象
                # 一般包括LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal 等等
                # 想要獲取文本就獲得對(duì)象的text屬性,
                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):
                        results = x.get_text()
                        fw.write(results)

刪除txt中的換行符

因?yàn)镻DF導(dǎo)出的txt會(huì)用換行符換行,為了避免詞語(yǔ)因此拆開(kāi),所以刪除所有的換行符

#對(duì)txt文件的換行符進(jìn)行刪除
def delete_huanhangfu(dealTxt,index):
    outPutString=''
    outPutTxt='outPutTxt\\'+str(index)+'.txt'
    with open(dealTxt,'r',encoding="utf-8") as f:
        lines=f.readlines()
        for i in range(len(lines)):
            if lines[i].endswith('\n'):
                lines[i]=lines[i][:-1] #將字符串末尾的\n去掉
        for j in range(len(lines)):
            outPutString+=lines[j]
    with open(outPutTxt,'w',encoding="utf-8") as fw:
        fw.write(outPutString)

添加自定義詞語(yǔ)

此處可以根據(jù)自己的需要自定義,傳入的wordsByMyself是全局變量

分詞與詞頻統(tǒng)計(jì)

調(diào)用jieba進(jìn)行分詞,讀取通用詞表去掉停用詞(此步其實(shí)可以省略,對(duì)最終結(jié)果影響不大),將詞語(yǔ)和出現(xiàn)次數(shù)合成為鍵值對(duì),輸出關(guān)鍵詞出現(xiàn)次數(shù)

#分詞并進(jìn)行詞頻統(tǒng)計(jì)
def cut_and_count(outPutTxt):
    with open(outPutTxt,encoding='utf-8') as f: 
        #step1:讀取文檔并調(diào)用jieba分詞
        text=f.read() 
        words=jieba.lcut(text)
        #step2:讀取停用詞表,去停用詞
        stopwords = {}.fromkeys([ line.rstrip() for line in open('stopwords.txt',encoding='utf-8') ])
        finalwords = []
        for word in words:
            if word not in stopwords:
                if (word != "。" and word != ",") :
                    finalwords.append(word)       
        
        
        #step3:統(tǒng)計(jì)特定關(guān)鍵詞的出現(xiàn)次數(shù)
        valuelist=[0]*len(wordsByMyself)
        counts=dict(zip(wordsByMyself,valuelist))
        for word in finalwords:
            if len(word) == 1:#單個(gè)詞不計(jì)算在內(nèi)
                continue
            else:
                counts[word]=counts.get(word,0)+1#遍歷所有詞語(yǔ),每出現(xiàn)一次其對(duì)應(yīng)值加1
        for i in range(len(wordsByMyself)):
            if wordsByMyself[i] in counts:
                print(wordsByMyself[i]+':'+str(counts[wordsByMyself[i]]))
            else:
                print(wordsByMyself[i]+':0')

主函數(shù)

通過(guò)for循環(huán)進(jìn)行批量操作

if __name__ == "__main__":
    #rename()   
    for i in range(1,fileNum+1):
        pdf_to_txt('dealPdf\\'+str(i)+'.pdf',i)#將pdf文件轉(zhuǎn)化成txt文件,傳入文件路徑 
        delete_huanhangfu('dealTxt\\'+str(i)+'.txt',i)#對(duì)txt文件的換行符進(jìn)行刪除,防止詞語(yǔ)因換行被拆分
        word_by_myself()#添加自定義詞語(yǔ)
        print(f'----------result {i}----------')
        cut_and_count('outPutTxt\\'+str(i)+'.txt')#分詞并進(jìn)行詞頻統(tǒng)計(jì),傳入文件路徑

本地文件結(jié)構(gòu)

全部代碼

import jieba
import jieba.analyse
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal, LAParams
from pdfminer.pdfpage import PDFPage,PDFTextExtractionNotAllowed
import logging
import os

wordsByMyself=['社會(huì)責(zé)任','義務(wù)','上市','公司'] #自定義詞語(yǔ),全局變量
fileNum=16#存儲(chǔ)總共待處理的文件數(shù)量

#重命名所有文件夾下的文件,適應(yīng)處理需要
def rename():
    path='dealPdf'
    filelist=os.listdir(path)
    for i,files in enumerate(filelist):
        Olddir=os.path.join(path,files)
        if os.path.isdir(Olddir):
            continue
        Newdir=os.path.join(path,str(i+1)+'.pdf')
        os.rename(Olddir,Newdir)

#將pdf文件轉(zhuǎn)化成txt文件
def pdf_to_txt(dealPdf,index):
    # 不顯示warning
    logging.propagate = False
    logging.getLogger().setLevel(logging.ERROR)
    pdf_filename = dealPdf
    device = PDFPageAggregator(PDFResourceManager(), laparams=LAParams())
    interpreter = PDFPageInterpreter(PDFResourceManager(), device)    
    parser = PDFParser(open(pdf_filename, 'rb'))
    doc = PDFDocument(parser)
    
    
    txt_filename='dealTxt\\'+str(index)+'.txt'
        
    # 檢測(cè)文檔是否提供txt轉(zhuǎn)換,不提供就忽略
    if not doc.is_extractable:
        raise PDFTextExtractionNotAllowed
    else:
        with open(txt_filename, 'w', encoding="utf-8") as fw:
            #print("num page:{}".format(len(list(doc.get_pages()))))
            for i,page in enumerate(PDFPage.create_pages(doc)):
                interpreter.process_page(page)
                # 接受該頁(yè)面的LTPage對(duì)象
                layout = device.get_result()
                # 這里layout是一個(gè)LTPage對(duì)象 里面存放著 這個(gè)page解析出的各種對(duì)象
                # 一般包括LTTextBox, LTFigure, LTImage, LTTextBoxHorizontal 等等
                # 想要獲取文本就獲得對(duì)象的text屬性,
                for x in layout:
                    if isinstance(x, LTTextBoxHorizontal):
                        results = x.get_text()
                        fw.write(results)

#對(duì)txt文件的換行符進(jìn)行刪除
def delete_huanhangfu(dealTxt,index):
    outPutString=''
    outPutTxt='outPutTxt\\'+str(index)+'.txt'
    with open(dealTxt,'r',encoding="utf-8") as f:
        lines=f.readlines()
        for i in range(len(lines)):
            if lines[i].endswith('\n'):
                lines[i]=lines[i][:-1] #將字符串末尾的\n去掉
        for j in range(len(lines)):
            outPutString+=lines[j]
    with open(outPutTxt,'w',encoding="utf-8") as fw:
        fw.write(outPutString)
            
#添加自定義詞語(yǔ)    
def word_by_myself():
    for i in range(len(wordsByMyself)):
        jieba.add_word(wordsByMyself[i])

#分詞并進(jìn)行詞頻統(tǒng)計(jì)
def cut_and_count(outPutTxt):
    with open(outPutTxt,encoding='utf-8') as f: 
        #step1:讀取文檔并調(diào)用jieba分詞
        text=f.read() 
        words=jieba.lcut(text)
        #step2:讀取停用詞表,去停用詞
        stopwords = {}.fromkeys([ line.rstrip() for line in open('stopwords.txt',encoding='utf-8') ])
        finalwords = []
        for word in words:
            if word not in stopwords:
                if (word != "。" and word != ",") :
                    finalwords.append(word)       
        
        
        #step3:統(tǒng)計(jì)特定關(guān)鍵詞的出現(xiàn)次數(shù)
        valuelist=[0]*len(wordsByMyself)
        counts=dict(zip(wordsByMyself,valuelist))
        for word in finalwords:
            if len(word) == 1:#單個(gè)詞不計(jì)算在內(nèi)
                continue
            else:
                counts[word]=counts.get(word,0)+1#遍歷所有詞語(yǔ),每出現(xiàn)一次其對(duì)應(yīng)值加1
        for i in range(len(wordsByMyself)):
            if wordsByMyself[i] in counts:
                print(wordsByMyself[i]+':'+str(counts[wordsByMyself[i]]))
            else:
                print(wordsByMyself[i]+':0')

#主函數(shù) 
if __name__ == "__main__":
    rename()   
    for i in range(1,fileNum+1):
        pdf_to_txt('dealPdf\\'+str(i)+'.pdf',i)#將pdf文件轉(zhuǎn)化成txt文件,傳入文件路徑 
        delete_huanhangfu('dealTxt\\'+str(i)+'.txt',i)#對(duì)txt文件的換行符進(jìn)行刪除,防止詞語(yǔ)因換行被拆分
        word_by_myself()#添加自定義詞語(yǔ)
        print(f'----------result {i}----------')
        cut_and_count('outPutTxt\\'+str(i)+'.txt')#分詞并進(jìn)行詞頻統(tǒng)計(jì),傳入文件路徑

結(jié)果預(yù)覽

到此這篇關(guān)于python批量處理PDF文檔輸出自定義關(guān)鍵詞的出現(xiàn)次數(shù)的文章就介紹到這了,更多相關(guān)python處理PDF文檔內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

  • Python利用pynput實(shí)現(xiàn)劃詞復(fù)制功能

    Python利用pynput實(shí)現(xiàn)劃詞復(fù)制功能

    這篇文章主要為大家想詳細(xì)介紹了Python如何利用pynput實(shí)現(xiàn)劃詞復(fù)制功能,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下
    2022-05-05
  • python實(shí)時(shí)監(jiān)控cpu小工具

    python實(shí)時(shí)監(jiān)控cpu小工具

    這篇文章主要為大家詳細(xì)介紹了python實(shí)時(shí)監(jiān)控cpu的小工具,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下
    2018-06-06
  • Blender?Python編程實(shí)現(xiàn)批量導(dǎo)入網(wǎng)格并保存渲染圖像

    Blender?Python編程實(shí)現(xiàn)批量導(dǎo)入網(wǎng)格并保存渲染圖像

    這篇文章主要為大家介紹了Blender?Python?編程實(shí)現(xiàn)批量導(dǎo)入網(wǎng)格并保存渲染圖像示例詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪
    2022-08-08
  • PyTorch搭建CNN實(shí)現(xiàn)風(fēng)速預(yù)測(cè)

    PyTorch搭建CNN實(shí)現(xiàn)風(fēng)速預(yù)測(cè)

    PyTorch是一個(gè)開(kāi)源的Python機(jī)器學(xué)習(xí)庫(kù),基于Torch,用于自然語(yǔ)言處理等應(yīng)用程序。它不僅能夠?qū)崿F(xiàn)強(qiáng)大的GPU加速,同時(shí)還支持動(dòng)態(tài)神經(jīng)網(wǎng)絡(luò)。本文將介紹PyTorch搭建CNN如何實(shí)現(xiàn)風(fēng)速預(yù)測(cè),感興趣的可以學(xué)習(xí)一下
    2021-12-12
  • 最新評(píng)論