快捷導(dǎo)航

Python提取PDF發(fā)票信息保存Excel文件并制作EXE程序的全過(guò)程

更新時(shí)間：2022年11月17日 09:55:26 作者：PromiseTo

之前零散的用過(guò)一點(diǎn)python做數(shù)據(jù)處理,這次又遇到一個(gè)數(shù)據(jù)處理的小功能,下面這篇文章主要給大家介紹了關(guān)于Python提取PDF發(fā)票信息保存Excel文件并制作EXE程序的相關(guān)資料,文中通過(guò)實(shí)例代碼介紹的非常詳細(xì),需要的朋友可以參考下

前言

通過(guò)本篇文章可學(xué)習(xí)pdf發(fā)票信息的提取，內(nèi)容保存至Excel，了解命令圖像工具Gooey，以及如何將python文件打包為exe程序

背景

現(xiàn)在電子發(fā)票越來(lái)越普遍，各公司開(kāi)票形式已基本無(wú)紙化。目前所在公司的情況是，每個(gè)人自己報(bào)賬，需要將發(fā)票信息(發(fā)票號(hào)/金額)填如K3系統(tǒng)進(jìn)行流程申請(qǐng)，另外將電子發(fā)票打印為紙質(zhì)并貼票后找領(lǐng)導(dǎo)簽字及審批K3的流程；之后在將紙質(zhì)單據(jù)送至財(cái)務(wù)審批。如果財(cái)務(wù)發(fā)現(xiàn)數(shù)據(jù)填寫(xiě)不正確，就會(huì)將流程和單據(jù)打回來(lái)，又要重新找領(lǐng)導(dǎo)簽字審批，相當(dāng)麻煩。

分析

為了減少報(bào)賬被打回來(lái)的情況，我們先分析，整個(gè)報(bào)賬過(guò)程中，存在的問(wèn)題

下載回來(lái)的發(fā)票文件命名不規(guī)范不容易識(shí)別
如：04500160011131347550.pdf

填K3時(shí)需要復(fù)制發(fā)票號(hào)和金額，還要選擇發(fā)票類型
每次要打開(kāi)pdf發(fā)票復(fù)制發(fā)票號(hào)和金額

發(fā)票可能存在錯(cuò)誤情況：如公司名稱或納稅人識(shí)別號(hào)不正確
每次還要檢查一下

優(yōu)化

一開(kāi)始的想法是，做一個(gè)程序，識(shí)別發(fā)票信息并提供復(fù)制按鈕，方便復(fù)制到k3。自動(dòng)檢查發(fā)票信息是否完整正確，公司名稱和納稅人識(shí)別號(hào)是否正確，如不正確則提示錯(cuò)誤，另外自動(dòng)將文件重名稱，及一鍵打印功能。

不過(guò)由于對(duì)Python的GUI還不大熟悉，就改了方式：將發(fā)票信息提取至Excel，在自己到Excel上復(fù)制信息和檢查發(fā)票是否完整，另外將文件重命名為開(kāi)票公司+金額的形式

最終效果

exe程序

文件重命名

實(shí)現(xiàn)

讀取pdf發(fā)票

使用pdfplumber，安裝命令pip install pdfplumber，

import pdfplumber
import re
import os

def re_text(bt, text):
    m1 = re.search(bt, text)
    if m1 is not None:
        return re_block(m1[0])

def re_block(text):
    return text.replace(' ', '').replace('　', '').replace('）', '').replace(')', '').replace('：', ':')

def get_pdf(dir_path):
    pdf_file = []
    for root, sub_dirs, file_names in os.walk(dir_path):
        for name in file_names:
            if name.endswith('.pdf'):
                filepath = os.path.join(root, name)
                pdf_file.append(filepath)
    return pdf_file

def read():
    filenames = get_pdf('C:\Users\Administrator\Desktop\a')  # 修改為自己的文件目錄
    for filename in filenames:
        print(filename)
        with pdfplumber.open(filename) as pdf:
            first_page = pdf.pages[0]
            pdf_text = first_page.extract_text()
            if '發(fā)票' not in pdf_text:
                continue
            # print(pdf_text)
            print('--------------------------------------------------------')
            print(re_text(re.compile(r'[\u4e00-\u9fa5]+電子普通發(fā)票.*?'), pdf_text))
            t2 = re_text(re.compile(r'[\u4e00-\u9fa5]+專用發(fā)票.*?'), pdf_text)
            if t2:
                print(t2)
            # print(re_text(re.compile(r'發(fā)票代碼(.*\d+)'), pdf_text))
            print(re_text(re.compile(r'發(fā)票號(hào)碼(.*\d+)'), pdf_text))
            print(re_text(re.compile(r'開(kāi)票日期(.*)'), pdf_text))
            print(re_text(re.compile(r'名\s*稱\s*[:：]\s*([\u4e00-\u9fa5]+)'), pdf_text))
            print(re_text(re.compile(r'納稅人識(shí)別號(hào)\s*[:：]\s*([a-zA-Z0-9]+)'), pdf_text))
            price = re_text(re.compile(r'小寫(xiě).*(.*[0-9.]+)'), pdf_text)

            print(price)
            company = re.findall(re.compile(r'名.*稱\s*[:：]\s*([\u4e00-\u9fa5]+)'), pdf_text)
            if company:
                print(re_block(company[len(company)-1]))
            print('--------------------------------------------------------')
read()

通過(guò)上述代碼可以實(shí)現(xiàn)對(duì)pdf發(fā)票的內(nèi)容識(shí)別和輸出功能，完整的功能請(qǐng)通過(guò)學(xué)習(xí)本文后續(xù)的內(nèi)容自主實(shí)現(xiàn)。

寫(xiě)入Excel

使用xlwt寫(xiě)Excel文件，安裝命令pip install xlwt，一個(gè)簡(jiǎn)單的例子如下

import xlwt

# 創(chuàng)建工作簿
wb = xlwt.Workbook()
# 創(chuàng)建表單
sh = wb.add_sheet('sheet 1')
# 寫(xiě)入數(shù)據(jù)
sh.write(0, 1, '姓名')
# 保存
wb.save('test.xls')

創(chuàng)建圖像界面

使用Gooey創(chuàng)建GUI圖像界面，安裝命令pip install Gooey

官網(wǎng)地址：https://github.com/chriskiehl/Gooey 目前是：15.4k stars

這里對(duì)Gooey的適用情況做一個(gè)說(shuō)明，Gooey適用于命令行的圖形工具，也就是只做輸入(有各種輸入/選擇框)和輸出的情況，不適用于做界面展示，無(wú)法添加自定義按鈕，如button等。使用print就能將輸出內(nèi)容顯示到GUI圖形界面上

一個(gè)簡(jiǎn)單的例子

from gooey import Gooey, GooeyParser

@Gooey(program_name="簡(jiǎn)單的實(shí)例")
def main():
    parser = GooeyParser(description="第一個(gè)示例!")
    parser.add_argument('文件路徑', widget="FileChooser")  # 文件選擇框
    parser.add_argument('日期', widget="DateChooser")  # 日期選擇框
    args = parser.parse_args()  # 接收界面?zhèn)鬟f的參數(shù)
    print(args)

if__name__ == '__main__':
    main()

打包為exe文件

使用pyinstaller將代碼打包為exe文件

安裝命令pip install pyinstaller

打包命令pyinstaller -F xxxxx.py -w (xxxxx.py改為具體的.py文件名)

等待打包完成，在代碼目錄的會(huì)生成dist文件夾，打開(kāi)后可以看到exe程序

打包完成

注意：程序有中文輸出的請(qǐng)查看該文章，避免打包后程序無(wú)法正常運(yùn)行，參考如下

附：解決Gooey在打包成exe文件后打印中文報(bào)UnicodeDecodeError: 'utf-8' codec can't decode

問(wèn)題

在使用Gooey這個(gè)工具生成GUI的時(shí)候，沒(méi)有打包前測(cè)試是好的，但是當(dāng)打包成exe文件后，雙擊exe運(yùn)行填入所需選項(xiàng)執(zhí)行報(bào)UnicodeDecodeError: 'utf-8' codec can't decode的錯(cuò)誤。

PS C:\Users\faces\Desktop\gooey demo\dist> .\auto.exe
Exception in thread Thread-1:
Traceback (most recent call last):
File "threading.py", line 926, in _bootstrap_inner
File "threading.py", line 870, in run
File "site-packages\gooey\gui\processor.py", line 71, in _forward_stdout
File "site-packages\gooey\gui\processor.py", line 84, in _extract_progress
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 13: invalid start byte

通過(guò)查找github中提交的issue發(fā)現(xiàn)是打包后環(huán)境的encoding與打包時(shí)的encoding不一致導(dǎo)致的問(wèn)題。

解決方案

在Gooey裝飾器中加入關(guān)鍵字參數(shù)encoding='cp936'

from gooey import Gooey, GooeyParser
@Gooey(encoding='cp936')
def main():
    parser = GooeyParser(description="Export music lite") 
    parser.add_argument('exe文件', widget="FileChooser")
    parser.add_argument('flac文件夾', widget="DirChooser")
    parser.add_argument('MP3導(dǎo)出文件夾', widget="DirChooser")
    args = parser.parse_args()
    print(args)
if __name__ == "__main__":
    main()

問(wèn)題

接下來(lái)我想更改Gooey生成的GUI頁(yè)面為中文，那么代碼改為

from gooey import Gooey, GooeyParser
@Gooey(encoding='cp936', language='chinese')
def main():
    parser = GooeyParser(description="Export music lite") 
    ...

這時(shí)候執(zhí)行會(huì)報(bào)如下錯(cuò)誤

PS C:\Users\faces\Desktop\gooey demo\dist> .\auto.exe
Traceback (most recent call last):
File "auto.py", line 17, in <module>
File "site-packages\gooey-1.0.3-py3.7.egg\gooey\python_bindings\gooey_decorator.py", line 87, in inner2
File "auto.py", line 12, in main
File "site-packages\gooey-1.0.3-py3.7.egg\gooey\python_bindings\gooey_parser.py", line 114, in parse_args
File "site-packages\gooey-1.0.3-py3.7.egg\gooey\python_bindings\gooey_decorator.py", line 82, in run_gooey
File "site-packages\gooey-1.0.3-py3.7.egg\gooey\gui\application.py", line 21, in run
File "site-packages\gooey-1.0.3-py3.7.egg\gooey\gui\application.py", line 28, in build_app
File "site-packages\gooey-1.0.3-py3.7.egg\gooey\gui\lang\i18n.py", line 24, in load
File "json\__init__.py", line 293, in load
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa7 in position 20: illegal multibyte sequence

解決方案

從報(bào)錯(cuò)信息中可以看到是因?yàn)閘oad代碼的時(shí)候使用encoding='cp936'去加載文件導(dǎo)致的錯(cuò)誤，那么簡(jiǎn)單粗暴的方法是編輯site-packages\gooey-1.0.3-py3.7.egg\gooey\gui\lang\i18n.py這個(gè)文件

找到

    with io.open(os.path.join(language_dir, json_file), 'r', encoding=encoding) as f:
      _DICTIONARY = json.load(f)

修改為

    with io.open(os.path.join(language_dir, json_file), 'r', encoding='utf-8') as f:
      _DICTIONARY = json.load(f)

總結(jié)

到此這篇關(guān)于Python提取PDF發(fā)票信息保存Excel文件并制作EXE程序的文章就介紹到這了,更多相關(guān)Python提取PDF發(fā)票信息保存Excel內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: