快捷導(dǎo)航

Java使用PDFBox提取PDF文本并統(tǒng)計(jì)關(guān)鍵詞出現(xiàn)的次數(shù)

更新時(shí)間：2025年05月16日 11:00:57 作者：碼農(nóng)研究僧

這篇文章主要介紹了Apache PDFBox庫的基本知識,包括如何使用PDDocument加載PDF文件、PDFTextStripper提取文本以及如何進(jìn)行詞頻統(tǒng)計(jì),還提供了在線URL的處理方法,需要的朋友可以參考下

1. 基本知識

Apache PDFBox 是一個(gè)開源的 Java PDF 操作庫，支持：

讀取 PDF 文件內(nèi)容（包括文字、圖片、元數(shù)據(jù)）
創(chuàng)建和修改 PDF 文檔
提取文本內(nèi)容用于搜索、分析等操作

Maven相關(guān)的依賴：

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.29</version>
</dependency>

需下載在進(jìn)行統(tǒng)計(jì)：

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PDFWordCounter {

    public static void main(String[] args) {
        String pdfPath = "sample.pdf";  // 替換為你的 PDF 文件路徑
        String keyword = "Java";        // 要統(tǒng)計(jì)的詞語

        try {
            // 加載 PDF 文檔
            PDDocument document = PDDocument.load(new File(pdfPath));

            // 使用 PDFTextStripper 提取文本
            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);
            document.close(); // 記得關(guān)閉文檔資源

            // 轉(zhuǎn)小寫處理，方便忽略大小寫
            String lowerText = text.toLowerCase();
            String lowerKeyword = keyword.toLowerCase();

            // 調(diào)用詞頻統(tǒng)計(jì)函數(shù)
            int count = countOccurrences(lowerText, lowerKeyword);

            System.out.println("詞語 \"" + keyword + "\" 出現(xiàn)次數(shù): " + count);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    // 使用 indexOf 遍歷匹配詞語出現(xiàn)次數(shù)
    private static int countOccurrences(String text, String word) {
        int count = 0;
        int index = 0;
        while ((index = text.indexOf(word, index)) != -1) {
            count++;
            index += word.length();
        }
        return count;
    }
}

上述的Demo詳細(xì)分析下核心知識：

PDDocument.load(File)
用于加載 PDF 文件到內(nèi)存中
PDFBox 使用 PDDocument 表示整個(gè) PDF 對象，使用完后必須調(diào)用 close() 釋放資源
PDFTextStripper
PDFBox 中用于提取文字的核心類，會盡可能“以閱讀順序”提取文本，適用于純文字 PDF 文件。對于圖像型掃描件則無效（需 OCR）
大小寫不敏感統(tǒng)計(jì)
實(shí)際應(yīng)用中搜索關(guān)鍵詞通常需要忽略大小寫，因此我們先統(tǒng)一將文本和關(guān)鍵詞轉(zhuǎn)換為小寫
indexOf 實(shí)現(xiàn)詞頻統(tǒng)計(jì)
這是最基礎(chǔ)也最直觀的統(tǒng)計(jì)方法，效率較高，但不夠精確
如果需要更精確（只統(tǒng)計(jì)完整單詞），可以使用正則：

Pattern pattern = Pattern.compile("\\b" + Pattern.quote(word) + "\\b", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(text);
int count = 0;
while (matcher.find()) {
    count++;
}

2. 在線URL

2.1 英文

此處的Demo需要注意一個(gè)點(diǎn)：

注意點(diǎn)	說明
PDF 文件是否公開訪問	不能訪問受密碼或登錄保護(hù)的 PDF
文件大小	不建議下載和分析過大文件，可能導(dǎo)致內(nèi)存問題
中文 PDF	若是掃描圖片形式的中文 PDF，則 PDFBox 無法直接提取文本（需 OCR）
編碼問題	若中文顯示為亂碼，可能是 PDF 沒有內(nèi)嵌字體

思路：

通過 URL.openStream() 獲取在線 PDF 的輸入流
使用 PDFBox 的 PDDocument.load(InputStream) 讀取 PDF
用 PDFTextStripper 提取文本
用字符串方法或正則統(tǒng)計(jì)關(guān)鍵詞頻率

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.InputStream;
import java.net.URL;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class OnlinePDFKeywordCounter {

    public static void main(String[] args) {
        String pdfUrl = "https://www.example.com/sample.pdf"; // 你的在線 PDF 鏈接
        String keyword = "Java";  // 需要統(tǒng)計(jì)的關(guān)鍵詞

        try (InputStream inputStream = new URL(pdfUrl).openStream();
             PDDocument document = PDDocument.load(inputStream)) {

            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);

            // 使用正則匹配單詞邊界（忽略大小寫）
            Pattern pattern = Pattern.compile("\\b" + Pattern.quote(keyword) + "\\b", Pattern.CASE_INSENSITIVE);
            Matcher matcher = pattern.matcher(text);

            int count = 0;
            while (matcher.find()) {
                count++;
            }

            System.out.println("詞語 \"" + keyword + "\" 出現(xiàn)在在線 PDF 中的次數(shù)為: " + count);

        } catch (Exception e) {
            System.err.println("處理 PDF 時(shí)出錯(cuò): " + e.getMessage());
            e.printStackTrace();
        }
    }
}

2.2 混合

方法	適用場景	是否支持中文
`indexOf`	中英文都適用	?
`Pattern + \\b`	僅限英文單詞匹配	? 中文不支持

正則表達(dá)式 \\b...\\b（表示“單詞邊界”）并不適用于中文

統(tǒng)計(jì)在想的URL PDF的詞頻：

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.InputStream;
import java.net.URL;

public class OnlinePDFKeywordCounter {

    public static void main(String[] args) {
        String pdfUrl = "https://www.xxxx.pdf";
        String keyword = "管理層";  // 要統(tǒng)計(jì)的中文關(guān)鍵詞

        try (InputStream inputStream = new URL(pdfUrl).openStream();
             PDDocument document = PDDocument.load(inputStream)) {

            PDFTextStripper stripper = new PDFTextStripper();
            String text = stripper.getText(document);

            // 直接用 indexOf 不區(qū)分大小寫（對于中文沒必要轉(zhuǎn)小寫）
            int count = countOccurrences(text, keyword);
            System.out.println("詞語 \"" + keyword + "\" 出現(xiàn)次數(shù)為: " + count);

        } catch (Exception e) {
            System.err.println("處理 PDF 時(shí)出錯(cuò): " + e.getMessage());
            e.printStackTrace();
        }
    }

    // 簡單統(tǒng)計(jì)子串出現(xiàn)次數(shù)（適用于中文）
    private static int countOccurrences(String text, String keyword) {
        int count = 0;
        int index = 0;
        while ((index = text.indexOf(keyword, index)) != -1) {
            count++;
            index += keyword.length();
        }
        return count;
    }
}

截圖如下：