快捷導(dǎo)航

使用Python分割并高效處理PDF大文件詳解

更新時(shí)間：2025年03月10日 09:45:38 作者：夢(mèng)想畫(huà)家

在處理大型PDF文件時(shí),將它們分解成更小、更易于管理的塊通常是有益的,本文將為大家介紹一下如何使用Python和為Unstructured.io庫(kù)實(shí)現(xiàn)分割PDF吧

在處理大型PDF文件時(shí)，將它們分解成更小、更易于管理的塊通常是有益的。這個(gè)過(guò)程稱(chēng)為分區(qū)，它可以提高處理效率，并使分析或操作文檔變得更容易。在本文中，我們將討論如何使用Python和為Unstructured.io庫(kù)將PDF文件劃分為更小的部分。

我們將使用兩個(gè)Python庫(kù)來(lái)完成此任務(wù)：

PyPDF2：一個(gè)可以讀、寫(xiě)、合并和分割PDF文件的庫(kù)。

Unstructured.io：一個(gè)可以使用文檔圖像分析模型分割PDF文檔的庫(kù)。

下面是完成這個(gè)任務(wù)的Python代碼：

from PyPDF2 import PdfReader, PdfWriter
from unstructured.partition.pdf import partition_pdf

import os
from os import path

# Create the output directory if it doesn't exist
# os.makedirs('./output', exist_ok=True)
path = path.abspath(path.dirname(__file__))

# pdf_file = path + '/sample01.pdf'

filename =  path + "/sample02.pdf"

# Read the original PDF
input_pdf = PdfReader(f'{filename}')

batch_size = 2
num_batches = len(input_pdf.pages) // batch_size + 1

filename = path + "/output" 
# Extract batches of 100 pages from the PDF
for b in range(num_batches):
    writer = PdfWriter()

    # Get the start and end page numbers for this batch
    start_page = b * batch_size
    end_page = min((b+1) * batch_size, len(input_pdf.pages))

    # Add pages in this batch to the writer
    for i in range(start_page, end_page):
        writer.add_page(input_pdf.pages[i])

    # Save the batch to a separate PDF file
    batch_filename = f'{filename}-batch{b+1}.pdf'
    with open(batch_filename, 'wb') as output_file:
        writer.write(output_file)

    # Now you can use the `partition_pdf` function from Unstructured.io to analyze the batch
    elements = partition_pdf(filename=batch_filename)
    print(elements)
    # Do something with `elements`...
    
    # This will process without issue
    # 抽取表格數(shù)據(jù)
	elements = partition_pdf("copy-protected.pdf", strategy="hi_res")

第一步：讀PDF文件

首先，我們從PyPDF2庫(kù)導(dǎo)入必要的類(lèi)：PdfReader和PdfWriter。PdfReader類(lèi)用于讀取原始PDF文件，該文件存儲(chǔ)在名為“exam-prep”的子目錄中。

步驟2：分區(qū)PDF

我們決定批大小，即PDF的每個(gè)塊將包含的頁(yè)數(shù)。在本例中，我們選擇了100頁(yè)的批處理大小，但這可以根據(jù)您的需要進(jìn)行調(diào)整。

然后通過(guò)將PDF中的總頁(yè)數(shù)除以批大小來(lái)計(jì)算批數(shù)量。添加1以確保在頁(yè)面總數(shù)不是批大小的倍數(shù)時(shí)捕獲所有剩余頁(yè)面。

步驟3：寫(xiě)PDF塊

接下來(lái)，循環(huán)遍歷每個(gè)批處理，為每個(gè)批處理創(chuàng)建一個(gè)新的PdfWriter對(duì)象。對(duì)于每個(gè)批處理，我們計(jì)算起始頁(yè)碼和結(jié)束頁(yè)碼，并使用add_page方法將該范圍內(nèi)的每個(gè)頁(yè)碼添加到PdfWriter。

一旦添加了批處理的所有頁(yè)面，我們將它們寫(xiě)入‘output’子目錄下的新PDF文件中。每個(gè)塊的文件名包括原始文件名和批號(hào)。

步驟4：分析PDF塊

將PDF分成更小的塊后，現(xiàn)在可以使用來(lái)自非結(jié)構(gòu)化的partition_pdf函數(shù)。IO庫(kù)來(lái)分析每個(gè)批處理。該函數(shù)使用文檔圖像分析模型對(duì)PDF文檔進(jìn)行分段，并返回已解析PDF文檔頁(yè)面中出現(xiàn)的元素列表。

最后總結(jié)

將大型PDF文件劃分為更小的塊可以使它們更容易、容錯(cuò)和消耗更少的內(nèi)存。

方法補(bǔ)充

下面小編為大家整理了其他Python分割PDF的相關(guān)方法，感興趣的可以了解下

方法一：批量分割PDF文件

現(xiàn)在，編寫(xiě)一個(gè)腳本來(lái)批量分割PDF文件。假設(shè)有一個(gè)大的PDF文件，需要每5頁(yè)切割成一個(gè)小文件。

import PyPDF2

def split_pdf(input_pdf, output_prefix, pages_per_file=5):
    with open(input_pdf, 'rb') as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        num_pages = pdf_reader.numPages

        for i in range(0, num_pages, pages_per_file):
            pdf_writer = PyPDF2.PdfFileWriter()
            output_pdf = f'{output_prefix}_{i // pages_per_file + 1}.pdf'

            for j in range(i, min(i + pages_per_file, num_pages)):
                page = pdf_reader.getPage(j)
                pdf_writer.addPage(page)

            with open(output_pdf, 'wb') as new_file:
                pdf_writer.write(new_file)

            print(f'已創(chuàng)建文件: {output_pdf}')

# 示例調(diào)用
split_pdf('large_file.pdf', 'output_split')

方法二：批量分割PDF

def main():
    directory = input("請(qǐng)輸入PDF文件所在目錄：")
    pdf_files = get_pdf_files(directory)
    split_rule = get_split_rule()
    output_directory = get_output_directory()

    for file in pdf_files:
        output_files = split_pdf(file, split_rule)
        save_output_files(output_files, output_directory)

    print("分割完成！")

if __name__ == "__main__":
    main()

到此這篇關(guān)于使用Python分割并高效處理PDF大文件詳解的文章就介紹到這了,更多相關(guān)Python PDF處理內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: