腳本之家服務器常用軟件

快捷導航

C++實現(xiàn)批量提取PDF內容

更新時間：2025年02月10日 10:26:33 作者：平安喜樂-開開心心

這篇文章主要為大家詳細介紹了如何使用C++批量提取PDF里文字內容并導出到表格以及批量給?PDF?文件改名,感興趣的小伙伴可以跟隨小編一起學習一下

批量提取 PDF 文字內容并導出到表格

應用場景

文檔數(shù)據整理：在處理大量學術論文、報告等 PDF 文檔時，需要提取其中的關鍵信息，如標題、作者、摘要等，并整理到表格中，方便后續(xù)的數(shù)據分析和比較。

信息歸檔：企業(yè)或機構可能有大量的合同、協(xié)議等 PDF 文檔，需要將其中的重要條款、日期、金額等信息提取出來，存儲到表格中進行統(tǒng)一管理和查詢。

實現(xiàn)方案和步驟

1. 選擇合適的庫

Poppler：用于解析 PDF 文件并提取文字內容。Poppler 是一個開源的 PDF 渲染庫，提供了 C++ 接口，可以方便地進行 PDF 文本提取。

LibXL：用于創(chuàng)建和操作 Excel 表格。它是一個跨平臺的 C++ 庫，支持創(chuàng)建、讀取和修改 Excel 文件。

2. 安裝依賴庫

在 Linux 系統(tǒng)上，可以使用包管理器安裝 Poppler 和 LibXL。例如，在 Ubuntu 上可以使用以下命令安裝 Poppler：

sudo apt-get install libpoppler-cpp-dev

對于 LibXL，需要從其官方網站下載庫文件，并將其包含到項目中。

3. 編寫代碼

#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <poppler/cpp/poppler-document.h>
#include <poppler/cpp/poppler-page.h>
#include "libxl.h"
 
using namespace libxl;
 
// 提取 PDF 文件中的文字內容
std::string extractTextFromPDF(const std::string& filePath) {
    poppler::document* doc = poppler::document::load_from_file(filePath);
    if (!doc || doc->is_locked()) {
        delete doc;
        return "";
    }
 
    std::string text;
    for (int i = 0; i < doc->pages(); ++i) {
        poppler::page* page = doc->create_page(i);
        if (page) {
            text += page->text().to_latin1();
            delete page;
        }
    }
 
    delete doc;
    return text;
}
 
// 批量提取 PDF 文件內容并導出到 Excel 表格
void batchExtractPDFsToExcel(const std::vector<std::string>& pdfFiles, const std::string& outputFilePath) {
    Book* book = xlCreateBook();
    if (book) {
        Sheet* sheet = book->addSheet("PDF Text");
        if (sheet) {
            for (size_t i = 0; i < pdfFiles.size(); ++i) {
                std::string text = extractTextFromPDF(pdfFiles[i]);
                sheet->writeStr(i, 0, pdfFiles[i].c_str());
                sheet->writeStr(i, 1, text.c_str());
            }
        }
        book->save(outputFilePath.c_str());
        book->release();
    }
}
 
int main() {
    std::vector<std::string> pdfFiles = {
        "file1.pdf",
        "file2.pdf",
        // 添加更多 PDF 文件路徑
    };
    std::string outputFilePath = "output.xlsx";
    batchExtractPDFsToExcel(pdfFiles, outputFilePath);
    return 0;
}

4. 編譯和運行

使用以下命令編譯代碼：

g++ -o extract_pdf extract_pdf.cpp -lpoppler-cpp -lxl

運行生成的可執(zhí)行文件：

./extract_pdf

批量給 PDF 文件改名

應用場景

文件整理：當從不同來源收集了大量 PDF 文件，文件名雜亂無章時，需要根據文件內容或特定規(guī)則對文件進行重命名，以便更好地管理和查找。

數(shù)據導入：在將 PDF 文件導入到某個系統(tǒng)或數(shù)據庫時，要求文件名遵循一定的命名規(guī)范，此時需要對文件進行批量重命名。

實現(xiàn)方案和步驟

1. 選擇合適的庫

使用標準 C++ 庫中的 <filesystem> （C++17 及以上）來處理文件和目錄操作。

2. 編寫代碼

#include <iostream>
#include <filesystem>
#include <string>
 
namespace fs = std::filesystem;
 
// 批量給 PDF 文件改名
void batchRenamePDFs(const std::string& directoryPath) {
    int counter = 1;
    for (const auto& entry : fs::directory_iterator(directoryPath)) {
        if (entry.is_regular_file() && entry.path().extension() == ".pdf") {
            fs::path newPath = entry.path().parent_path() / (std::to_string(counter) + ".pdf");
            fs::rename(entry.path(), newPath);
            std::cout << "Renamed " << entry.path() << " to " << newPath << std::endl;
            ++counter;
        }
    }
}
 
int main() {
    std::string directoryPath = "./pdfs"; // 替換為實際的 PDF 文件目錄
    batchRenamePDFs(directoryPath);
    return 0;
}

3. 編譯和運行

使用以下命令編譯代碼：

g++ -std=c++17 -o rename_pdf rename_pdf.cpp

運行生成的可執(zhí)行文件：

./rename_pdf

以上代碼示例提供了基本的實現(xiàn)思路，你可以根據實際需求進行擴展和修改。

到此這篇關于C++實現(xiàn)批量提取PDF內容的文章就介紹到這了,更多相關C++提取PDF內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

C++
PDF

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

C++實現(xiàn)批量提取PDF內容

目錄

批量提取 PDF 文字內容并導出到表格

應用場景

實現(xiàn)方案和步驟

批量給 PDF 文件改名

應用場景

實現(xiàn)方案和步驟

相關文章

最新評論

大家感興趣的內容

最近更新的內容

常用在線小工具