Python實(shí)現(xiàn)pdf電子發(fā)票信息提取到excel表格

更新時間：2025年05月28日 10:55:18 作者：平安喜樂-開開心心

這篇文章主要為大家詳細(xì)介紹了如何使用Python實(shí)現(xiàn)pdf電子發(fā)票信息提取并保存到excel表格,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下

應(yīng)用場景

電子發(fā)票信息提取系統(tǒng)主要應(yīng)用于以下場景：

企業(yè)財務(wù)部門：需要處理大量電子發(fā)票，提取關(guān)鍵信息（如發(fā)票代碼、號碼、金額等）并錄入財務(wù)系統(tǒng)。

會計事務(wù)所：在進(jìn)行審計或賬務(wù)處理時，需要從大量發(fā)票中提取信息進(jìn)行分析。

報銷管理：員工提交電子發(fā)票進(jìn)行報銷時，系統(tǒng)自動提取信息，減少人工錄入錯誤。

檔案管理：對電子發(fā)票進(jìn)行分類、歸檔和檢索時，提取的信息可以作為索引。

數(shù)據(jù)分析：從大量發(fā)票中提取數(shù)據(jù)，進(jìn)行企業(yè)支出分析、稅務(wù)籌劃等。

界面設(shè)計

系統(tǒng)采用圖形化界面設(shè)計，主要包含以下幾個部分：

文件選擇區(qū)域：提供 "選擇文件" 和 "選擇文件夾" 按鈕，方便用戶批量選擇電子發(fā)票文件。

文件列表區(qū)域：顯示已選擇的文件列表，支持多選操作。

處理選項(xiàng)區(qū)域：用戶可以指定輸出 Excel 文件的路徑和名稱。

進(jìn)度顯示區(qū)域：包含進(jìn)度條和狀態(tài)文本，實(shí)時顯示處理進(jìn)度。

操作按鈕區(qū)域：提供 "開始處理"、"清空列表" 和 "退出" 等功能按鈕。

界面設(shè)計簡潔明了，符合用戶操作習(xí)慣，同時提供了必要的提示和反饋信息。

詳細(xì)代碼步驟

import os
import re
import fitz  # PyMuPDF
import pandas as pd
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
import xml.etree.ElementTree as ET
import tkinter as tk
from tkinter import filedialog, messagebox, ttk
import threading
import logging
from datetime import datetime
 
# 配置日志
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    filename='invoice_extractor.log'
)
logger = logging.getLogger(__name__)
 
class InvoiceExtractor:
    def __init__(self):
        # 初始化配置
        self.config = {
            'pdf': {
                'invoice_code_pattern': r'發(fā)票代碼：(\d+)',
                'invoice_number_pattern': r'發(fā)票號碼：(\d+)',
                'date_pattern': r'日期：(\d{4}年\d{1,2}月\d{1,2}日)',
                'amount_pattern': r'金額：￥(\d+\.\d{2})',
                'tax_pattern': r'稅額：￥(\d+\.\d{2})',
                'total_pattern': r'價稅合計：￥(\d+\.\d{2})'
            },
            'ofd': {
                'invoice_code_xpath': './/TextObject[starts-with(text(), "發(fā)票代碼")]/following-sibling::TextObject[1]',
                'invoice_number_xpath': './/TextObject[starts-with(text(), "發(fā)票號碼")]/following-sibling::TextObject[1]',
                'date_xpath': './/TextObject[starts-with(text(), "日期")]/following-sibling::TextObject[1]',
                'amount_xpath': './/TextObject[starts-with(text(), "金額")]/following-sibling::TextObject[1]',
                'tax_xpath': './/TextObject[starts-with(text(), "稅額")]/following-sibling::TextObject[1]',
                'total_xpath': './/TextObject[starts-with(text(), "價稅合計")]/following-sibling::TextObject[1]'
            }
        }
    
    def extract_pdf_info(self, pdf_path):
        """提取PDF電子發(fā)票信息"""
        try:
            info = {
                '文件路徑': pdf_path,
                '發(fā)票代碼': '',
                '發(fā)票號碼': '',
                '日期': '',
                '金額': '',
                '稅額': '',
                '價稅合計': ''
            }
            
            with fitz.open(pdf_path) as doc:
                text = ""
                for page in doc:
                    text += page.get_text()
            
            # 使用正則表達(dá)式提取信息
            for key, pattern in self.config['pdf'].items():
                match = re.search(pattern, text)
                if match:
                    info[key.replace('_pattern', '')] = match.group(1)
            
            # 如果無法通過文本提取，則使用OCR
            if not all(info.values()):
                images = convert_from_path(pdf_path)
                ocr_text = ""
                for image in images:
                    ocr_text += pytesseract.image_to_string(image, lang='chi_sim')
                
                for key, pattern in self.config['pdf'].items():
                    if not info[key.replace('_pattern', '')]:
                        match = re.search(pattern, ocr_text)
                        if match:
                            info[key.replace('_pattern', '')] = match.group(1)
            
            return info
        except Exception as e:
            logger.error(f"提取PDF {pdf_path} 信息失敗: {str(e)}")
            return None
    
    def extract_ofd_info(self, ofd_path):
        """提取OFD電子發(fā)票信息"""
        try:
            info = {
                '文件路徑': ofd_path,
                '發(fā)票代碼': '',
                '發(fā)票號碼': '',
                '日期': '',
                '金額': '',
                '稅額': '',
                '價稅合計': ''
            }
            
            # OFD文件實(shí)際上是一個ZIP壓縮包
            # 這里簡化處理，假設(shè)我們已經(jīng)將OFD解壓并獲取到了XML文件
            # 實(shí)際應(yīng)用中需要處理OFD文件的解壓縮和解析
            # 以下代碼僅為示例
            
            # 假設(shè)我們已經(jīng)獲取到了OFD的XML內(nèi)容
            # tree = ET.parse(ofd_xml_path)
            # root = tree.getroot()
            
            # for key, xpath in self.config['ofd'].items():
            #     element = root.find(xpath)
            #     if element is not None:
            #         info[key.replace('_xpath', '')] = element.text
            
            # 由于OFD格式的復(fù)雜性，這里使用OCR作為替代方案
            images = convert_from_path(ofd_path)
            ocr_text = ""
            for image in images:
                ocr_text += pytesseract.image_to_string(image, lang='chi_sim')
            
            for key, pattern in self.config['pdf'].items():
                if key in info:
                    match = re.search(pattern, ocr_text)
                    if match:
                        info[key.replace('_pattern', '')] = match.group(1)
            
            return info
        except Exception as e:
            logger.error(f"提取OFD {ofd_path} 信息失敗: {str(e)}")
            return None
    
    def batch_process_files(self, files, output_path):
        """批量處理文件并導(dǎo)出到Excel"""
        results = []
        total = len(files)
        processed = 0
        
        for file_path in files:
            try:
                file_ext = os.path.splitext(file_path)[1].lower()
                
                if file_ext == '.pdf':
                    info = self.extract_pdf_info(file_path)
                elif file_ext == '.ofd':
                    info = self.extract_ofd_info(file_path)
                else:
                    logger.warning(f"不支持的文件類型: {file_path}")
                    continue
                
                if info:
                    results.append(info)
            except Exception as e:
                logger.error(f"處理文件 {file_path} 時出錯: {str(e)}")
            
            processed += 1
            yield processed, total
        
        # 導(dǎo)出到Excel
        if results:
            df = pd.DataFrame(results)
            df.to_excel(output_path, index=False)
            logger.info(f"成功導(dǎo)出 {len(results)} 條記錄到 {output_path}")
            return True
        else:
            logger.warning("沒有可導(dǎo)出的數(shù)據(jù)")
            return False
 
class InvoiceExtractorGUI:
    def __init__(self, root):
        self.root = root
        self.root.title("電子發(fā)票信息提取系統(tǒng)")
        self.root.geometry("800x600")
        
        self.extractor = InvoiceExtractor()
        self.selected_files = []
        self.is_processing = False
        
        self.create_widgets()
    
    def create_widgets(self):
        """創(chuàng)建GUI界面"""
        # 創(chuàng)建主框架
        main_frame = ttk.Frame(self.root, padding="10")
        main_frame.pack(fill=tk.BOTH, expand=True)
        
        # 文件選擇區(qū)域
        file_frame = ttk.LabelFrame(main_frame, text="文件選擇", padding="10")
        file_frame.pack(fill=tk.X, pady=5)
        
        ttk.Button(file_frame, text="選擇文件", command=self.select_files).pack(side=tk.LEFT, padx=5)
        ttk.Button(file_frame, text="選擇文件夾", command=self.select_folder).pack(side=tk.LEFT, padx=5)
        
        self.file_count_var = tk.StringVar(value="已選擇 0 個文件")
        ttk.Label(file_frame, textvariable=self.file_count_var).pack(side=tk.RIGHT, padx=5)
        
        # 文件列表區(qū)域
        list_frame = ttk.LabelFrame(main_frame, text="文件列表", padding="10")
        list_frame.pack(fill=tk.BOTH, expand=True, pady=5)
        
        # 創(chuàng)建滾動條
        scrollbar = ttk.Scrollbar(list_frame)
        scrollbar.pack(side=tk.RIGHT, fill=tk.Y)
        
        # 創(chuàng)建文件列表
        self.file_listbox = tk.Listbox(list_frame, yscrollcommand=scrollbar.set, selectmode=tk.EXTENDED)
        self.file_listbox.pack(fill=tk.BOTH, expand=True)
        scrollbar.config(command=self.file_listbox.yview)
        
        # 處理選項(xiàng)區(qū)域
        options_frame = ttk.LabelFrame(main_frame, text="處理選項(xiàng)", padding="10")
        options_frame.pack(fill=tk.X, pady=5)
        
        ttk.Label(options_frame, text="輸出文件:").pack(side=tk.LEFT, padx=5)
        
        default_output = os.path.join(os.getcwd(), f"發(fā)票信息_{datetime.now().strftime('%Y%m%d_%H%M%S')}.xlsx")
        self.output_path_var = tk.StringVar(value=default_output)
        
        output_entry = ttk.Entry(options_frame, textvariable=self.output_path_var, width=50)
        output_entry.pack(side=tk.LEFT, padx=5)
        
        ttk.Button(options_frame, text="瀏覽", command=self.browse_output).pack(side=tk.LEFT, padx=5)
        
        # 進(jìn)度條區(qū)域
        progress_frame = ttk.Frame(main_frame, padding="10")
        progress_frame.pack(fill=tk.X, pady=5)
        
        self.progress_var = tk.DoubleVar(value=0)
        self.progress_bar = ttk.Progressbar(progress_frame, variable=self.progress_var, maximum=100)
        self.progress_bar.pack(fill=tk.X)
        
        self.status_var = tk.StringVar(value="就緒")
        ttk.Label(progress_frame, textvariable=self.status_var).pack(anchor=tk.W)
        
        # 按鈕區(qū)域
        button_frame = ttk.Frame(main_frame, padding="10")
        button_frame.pack(fill=tk.X, pady=5)
        
        self.process_button = ttk.Button(button_frame, text="開始處理", command=self.start_processing)
        self.process_button.pack(side=tk.LEFT, padx=5)
        
        ttk.Button(button_frame, text="清空列表", command=self.clear_file_list).pack(side=tk.LEFT, padx=5)
        ttk.Button(button_frame, text="退出", command=self.root.quit).pack(side=tk.RIGHT, padx=5)
    
    def select_files(self):
        """選擇多個文件"""
        if self.is_processing:
            return
        
        files = filedialog.askopenfilenames(
            title="選擇電子發(fā)票文件",
            filetypes=[("PDF文件", "*.pdf"), ("OFD文件", "*.ofd"), ("所有文件", "*.*")]
        )
        
        if files:
            self.selected_files = list(files)
            self.update_file_list()
    
    def select_folder(self):
        """選擇文件夾"""
        if self.is_processing:
            return
        
        folder = filedialog.askdirectory(title="選擇包含電子發(fā)票的文件夾")
        
        if folder:
            pdf_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith('.pdf')]
            ofd_files = [os.path.join(folder, f) for f in os.listdir(folder) if f.lower().endswith('.ofd')]
            self.selected_files = pdf_files + ofd_files
            self.update_file_list()
    
    def update_file_list(self):
        """更新文件列表顯示"""
        self.file_listbox.delete(0, tk.END)
        for file_path in self.selected_files:
            self.file_listbox.insert(tk.END, os.path.basename(file_path))
        self.file_count_var.set(f"已選擇 {len(self.selected_files)} 個文件")
    
    def browse_output(self):
        """瀏覽輸出文件位置"""
        if self.is_processing:
            return
        
        output_file = filedialog.asksaveasfilename(
            title="保存輸出文件",
            defaultextension=".xlsx",
            filetypes=[("Excel文件", "*.xlsx"), ("所有文件", "*.*")]
        )
        
        if output_file:
            self.output_path_var.set(output_file)
    
    def clear_file_list(self):
        """清空文件列表"""
        if self.is_processing:
            return
        
        self.selected_files = []
        self.update_file_list()
    
    def start_processing(self):
        """開始處理文件"""
        if self.is_processing or not self.selected_files:
            return
        
        output_path = self.output_path_var.get()
        if not output_path:
            messagebox.showerror("錯誤", "請指定輸出文件")
            return
        
        # 確認(rèn)是否覆蓋現(xiàn)有文件
        if os.path.exists(output_path):
            if not messagebox.askyesno("確認(rèn)", "輸出文件已存在，是否覆蓋？"):
                return
        
        self.is_processing = True
        self.process_button.config(state=tk.DISABLED)
        self.status_var.set("正在處理...")
        
        # 在單獨(dú)的線程中處理文件
        threading.Thread(target=self.process_files_thread, daemon=True).start()
    
    def process_files_thread(self):
        """文件處理線程"""
        try:
            output_path = self.output_path_var.get()
            progress = 0
            total = len(self.selected_files)
            
            for processed, total in self.extractor.batch_process_files(self.selected_files, output_path):
                progress = (processed / total) * 100
                self.progress_var.set(progress)
                self.status_var.set(f"正在處理 {processed}/{total}")
                self.root.update_idletasks()
            
            self.progress_var.set(100)
            self.status_var.set("處理完成")
            messagebox.showinfo("成功", f"已成功處理 {total} 個文件\n結(jié)果已保存至: {output_path}")
        except Exception as e:
            logger.error(f"處理過程中出錯: {str(e)}")
            self.status_var.set("處理出錯")
            messagebox.showerror("錯誤", f"處理過程中出錯: {str(e)}")
        finally:
            self.is_processing = False
            self.process_button.config(state=tk.NORMAL)
 
def main():
    """主函數(shù)"""
    try:
        # 檢查依賴庫
        import PyMuPDF
        import pandas
        import pdf2image
        import pytesseract
        from PIL import Image
        
        # 創(chuàng)建GUI
        root = tk.Tk()
        app = InvoiceExtractorGUI(root)
        root.mainloop()
    except ImportError as e:
        print(f"缺少必要的庫: {str(e)}")
        print("請安裝所有依賴庫: pip install PyMuPDF pandas pdf2image pytesseract pillow")
    except Exception as e:
        print(f"程序啟動出錯: {str(e)}")
        logger.error(f"程序啟動出錯: {str(e)}")
 
if __name__ == "__main__":
    main()

系統(tǒng)實(shí)現(xiàn)主要包含以下幾個核心模塊：

配置管理：設(shè)置 PDF 和 OFD 文件的信息提取規(guī)則，包括正則表達(dá)式模式和 OFD 的 XPath 表達(dá)式。

PDF 信息提?。菏褂?PyMuPDF 庫讀取 PDF 文本內(nèi)容，通過正則表達(dá)式提取關(guān)鍵信息；如果文本提取失敗，則使用 OCR 技術(shù)進(jìn)行圖像識別。

OFD 信息提?。篛FD 文件結(jié)構(gòu)復(fù)雜，本系統(tǒng)采用 OCR 技術(shù)作為主要提取方法，將 OFD 轉(zhuǎn)換為圖像后使用 pytesseract 進(jìn)行文字識別。

批量處理：支持批量處理多個文件，并提供進(jìn)度反饋。

數(shù)據(jù)導(dǎo)出：將提取的信息整理成 DataFrame，并導(dǎo)出為 Excel 文件。

圖形界面：使用 tkinter 構(gòu)建直觀易用的圖形界面，支持文件選擇、處理選項(xiàng)設(shè)置和進(jìn)度顯示。

總結(jié)優(yōu)化

該系統(tǒng)提供了一個基礎(chǔ)的電子發(fā)票信息提取解決方案，具有以下優(yōu)點(diǎn)：

通用性：支持 PDF 和 OFD 兩種主流電子發(fā)票格式。
可擴(kuò)展性：配置文件分離，易于添加新的發(fā)票格式或修改提取規(guī)則。
用戶友好：圖形界面操作簡單，適合非技術(shù)人員使用。
日志記錄：完整的日志記錄，便于問題排查和系統(tǒng)優(yōu)化。

然而，系統(tǒng)仍有以下可以優(yōu)化的地方：

OFD 解析：當(dāng)前使用 OCR 處理 OFD 文件效率較低，可以研究更高效的 OFD 解析庫。
提取規(guī)則優(yōu)化：針對不同類型的發(fā)票，可能需要定制化的提取規(guī)則，可考慮添加規(guī)則配置界面。
性能優(yōu)化：對于大量文件的處理，可以引入多線程或異步處理提高效率。
數(shù)據(jù)驗(yàn)證：增加提取信息的驗(yàn)證機(jī)制，提高數(shù)據(jù)準(zhǔn)確性。
用戶體驗(yàn)：添加更多交互反饋，如文件預(yù)覽、處理結(jié)果預(yù)覽等功能。

通過不斷優(yōu)化和擴(kuò)展，該系統(tǒng)可以滿足更多場景的需求，提高電子發(fā)票信息處理的效率和準(zhǔn)確性。

到此這篇關(guān)于Python實(shí)現(xiàn)pdf電子發(fā)票信息提取到excel表格的文章就介紹到這了,更多相關(guān)Python提取pdf信息保存到excel內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: