利用Python提取PDF文本的簡單方法實(shí)例

更新時間：2022年07月25日 11:56:57 作者：somenzz

日常工作中我們經(jīng)常會用到pdf格式的文件,大多數(shù)情況下是瀏覽或者編輯pdf信息,但有時候需要提取pdf中的文本,下面這篇文章主要給大家介紹了關(guān)于利用Python提取PDF文本的簡單方法,需要的朋友可以參考下

第一步，安裝工具庫

1、tika — 用于從各種文件格式中進(jìn)行文檔類型檢測和內(nèi)容提取

2、wand — 基于 ctypes 的簡單 ImageMagick 綁定

3、pytesseract — OCR 識別工具

創(chuàng)建一個虛擬環(huán)境，安裝這些工具

python -m venv venv
source venv/bin/activate
pip install tika wand pytesseract

第二步，編寫代碼

假如 pdf 文件里面既有文字，又有圖片，以下代碼可以直接識別文字：

import io
import pytesseract
import sys
 
from PIL import Image
from tika import parser
from wand.image import Image as wi
 
text_raw = parser.from_file("example.pdf")
print(text_raw['content'].strip())

這還不夠，我們還需要能失敗圖片的部分：

def extract_text_image(from_file, lang='deu', image_type='jpeg', resolution=300):
    print("-- Parsing image", from_file, "--")
    print("---------------------------------")
    pdf_file = wi(filename=from_file, resolution=resolution)
    image = pdf_file.convert(image_type)
    image_blobs = []
    for img in image.sequence:
        img_page = wi(image=img)
        image_blobs.append(img_page.make_blob(image_type))
    extract = []
    for img_blob in image_blobs:
        image = Image.open(io.BytesIO(img_blob))
        text = pytesseract.image_to_string(image, lang=lang)
        extract.append(text)
    for item in extract:
        for line in item.split("\n"):
            print(line)

合并一下，完整代碼如下：

import io
import sys
 
from PIL import Image
import pytesseract
from wand.image import Image as wi
from tika import parser
 
def extract_text_image(from_file, lang='deu', image_type='jpeg', resolution=300):
    print("-- Parsing image", from_file, "--")
    print("---------------------------------")
    pdf_file = wi(filename=from_file, resolution=resolution)
    image = pdf_file.convert(image_type)
    for img in image.sequence:
        img_page = wi(image=img)
        image = Image.open(io.BytesIO(img_page.make_blob(image_type)))
        text = pytesseract.image_to_string(image, lang=lang)
        for part in text.split("\n"):
            print("{}".format(part))
 
def parse_text(from_file):
    print("-- Parsing text", from_file, "--")
    text_raw = parser.from_file(from_file)
    print("---------------------------------")
    print(text_raw['content'].strip())
    print("---------------------------------")
 
if __name__ == '__main__':
    parse_text(sys.argv[1])
    extract_text_image(sys.argv[1], sys.argv[2])

第三步，執(zhí)行

假如 example.pdf 是這樣的：

在命令行這樣執(zhí)行：

python run.py example.pdf deu | xargs -0 echo > extract.txt

最終 extract.txt 的結(jié)果如下：

-- Parsing text example.pdf --
---------------------------------
Title pure text

Content pure text

Slide 1
Slide 2
---------------------------------
-- Parsing image example.pdf --
---------------------------------
Title pure text

Content pure text

Title in image

Text in image