快捷導(dǎo)航

Python提取Pdf文件內(nèi)容的操作方法(表格、文本、圖片)

更新時(shí)間：2025年03月25日 09:27:57 作者：老胖閑聊

本文主要介紹了Python提取PDF內(nèi)容的方法(文本、圖像、表格),文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

Pdf文件是我們?nèi)粘９ぷ髦薪?jīng)常會(huì)遇到的一種文件格式，對(duì)于這種文件的提取 pdfplumber 庫(kù)可以非常出色的完成處理工作，它是一個(gè)純 Python 第三方庫(kù)，適合 python 3.x 版本，通常用來查看pdf各類信息，能有效提取文本、表格，但不支持修改或生成pdf，也不支持對(duì)pdf掃描件的處理。下面就出表格、文本和圖片的提取三方面進(jìn)行說明。

1、表格提?。?/h2>

下面是提取PDF文件內(nèi)容中的表格，并保存到XLSX文件中，代碼如下：

import pdfplumber
from openpyxl import load_workbook
import pandas as pd
 
i=0
with pdfplumber.open("d:\\待提取的PDF文件.pdf") as pdf:
    print(len(pdf.pages))
    for page in pdf.pages:
        tables=page.extract_tables()
        for table in tables:
            i=i+1
            df = pd.DataFrame(table)
            df.to_excel(f'd:\\output{i}.xlsx', index=False)
 
def readExcels(excelname):
    alldata = pd.DataFrame()
    wb = load_workbook(excelname)
    sheets = wb.sheetnames
    for i in sheets:
        #print(i)
        df = pd.read_excel(excelname,sheet_name=i,engine='openpyxl',header=1)
        alldata = alldata._append(df)

2、文本提?。?/h2>

下面是提取PDF文件內(nèi)容中的文字內(nèi)容，并保存到txt文件中，代碼如下：

import os
from pathlib import Path
import pdfplumber
 
def extract_text(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text
 
# 使用示例
pdf_path_name = "d:\\待提取PDF文件.pdf"
pdf_dirname = os.path.dirname(pdf_path_name)
extracted_text = extract_text(pdf_path_name)
with open(f'{pdf_dirname}/{Path(pdf_path_name).stem}.txt', 'w', encoding='utf-8') as f:
    f.write(extracted_text)
    f.close()
print(f'執(zhí)行完畢！輸出路徑：{pdf_dirname}')

3、提取圖片：

下面是提取PDF文件內(nèi)容中的圖片，并創(chuàng)建目錄保存，代碼如下：

import pdfplumber
import os
 
# 定義函數(shù)用于提取PDF中的圖片并保存
def extract_images_from_pdf(pdf_file, output_folder):
    # 創(chuàng)建輸出文件夾，如果不存在的話
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)
 
    with pdfplumber.open(pdf_file) as pdf:
        # 遍歷每一頁(yè)
        for page_number, page in enumerate(pdf.pages, start=1):
            print(f'頁(yè)碼：{page.page_number}')
            print(f'頁(yè)面寬度：{page.width}')
            print(f'頁(yè)面高度：{page.height}')
 
            # 獲取該頁(yè)的所有圖片
            images = page.images
 
            # 遍歷該頁(yè)的所有圖片
            for idx, image in enumerate(images, start=1):
                # 獲取圖片的二進(jìn)制數(shù)據(jù)
                image_data = image['stream'].get_data()
 
                # 構(gòu)建圖片文件名
                image_filename = os.path.join(output_folder, f'image_{page_number}_{idx}.png')
 
                # 保存圖片到文件
                with open(image_filename, 'wb') as f:
                    f.write(image_data)
                    print(f'圖片已保存至：{image_filename}')
 
# 調(diào)用方法
pdf_file = 'd:\\待提取的PDF文件.pdf'
output_folder = 'extracted_images'
extract_images_from_pdf(pdf_file, output_folder)

拓展：python提取pdf文件文字（OCR）

一、引用的庫(kù)

import pdfplumber

確保已安裝以上的庫(kù)，不然運(yùn)行會(huì)報(bào)錯(cuò)

#安裝庫(kù)可以用清華的鏡像網(wǎng)站（可能會(huì)更新，可以上官網(wǎng)查詢地址）

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

二、提取文字

def pdf_to_text(pdf_path):
    """
    從 PDF 文件中提取文本內(nèi)容。
    Args:
        pdf_path (str): PDF 文件的路徑。
    Returns:
        str: 提取的文本內(nèi)容。
    """
    text = ''
    # 打開PDF文件
    with pdfplumber.open(pdf_path) as pdf:
        # 遍歷每一頁(yè)
        for page in pdf.pages:
            # 提取文本，并添加到text變量中
            text += page.extract_text()
 
    # 移除文本中的換行符（根據(jù)需求設(shè)置）
    text = text.replace('\n', '')
 
    return text

其中'yourfile'替換成你的文件的地址。

三、保存文字

def savewords():
    """
    將 PDF 文件中提取的文本保存到文本文件中。
    """
    # 輸入PDF文件路徑
    pdf_path = 'yourfile'
    # 調(diào)用pdf_to_text函數(shù)進(jìn)行文字提取
    extracted_text = pdf_to_text(pdf_path)
 
    # 將提取的文字寫入到txt文件中
    with open('extracted_text.txt', 'w', encoding='utf-8') as file:
        file.write(extracted_text)
    print("Text extracted successfully!")

注意，保存文字和提取文字是不可分割的兩個(gè)“方法”，在寫代碼時(shí)要一起復(fù)制粘貼。此處分開是為了容易理解

四、運(yùn)行

if __name__ == "__main__":
    savewords()

總結(jié)

import pdfplumber
def pdf_to_text(pdf_path):
    """
    從 PDF 文件中提取文本內(nèi)容。
    Args:
        pdf_path (str): PDF 文件的路徑。
    Returns:
        str: 提取的文本內(nèi)容。
    """
    text = ''
    # 打開PDF文件
    with pdfplumber.open(pdf_path) as pdf:
        # 遍歷每一頁(yè)
        for page in pdf.pages:
            # 提取文本，并添加到text變量中
            text += page.extract_text()
 
    # 移除文本中的換行符（根據(jù)需求設(shè)置）
    text = text.replace('\n', '')
 
    return text
 
def savewords():
    """
    將 PDF 文件中提取的文本保存到文本文件中。
    """
    # 輸入PDF文件路徑
    pdf_path = 'yourfile'
    # 調(diào)用pdf_to_text函數(shù)進(jìn)行文字提取
    extracted_text = pdf_to_text(pdf_path)
 
    # 將提取的文字寫入到txt文件中
    with open('extracted_text.txt', 'w', encoding='utf-8') as file:
        file.write(extracted_text)
    print("Text extracted successfully!")
 
if __name__ == "__main__":
    savewords()

到此這篇關(guān)于Python提取Pdf文件內(nèi)容的操作方法(表格、文本、圖片)的文章就介紹到這了,更多相關(guān)Python提取Pdf內(nèi)容內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: