Python讀取pdf、word、excel、ppt、csv和txt文件提取所有文本

更新時間：2023年08月21日 11:07:21 作者：DreamingBetter

這篇文章主要給大家介紹了關于Python讀取pdf、word、excel、ppt、csv和txt文件提取所有文本的相關資料,文中通過代碼示例將實現的方法介紹的非常詳細,需要的朋友可以參考下

前言

本文對使用python讀取pdf、word、excel、ppt、csv、txt等常用文件，并提取所有文本的方法進行分享和使用總結。

可以讀取不同文件的庫和方法當然不止下面分享的這些，本文的代碼主要目標都是：方便提取文件中所有文本的實現方式。

這些庫的更多使用方法，請到官方文檔中查閱。

讀取PDF文本：PyPDF2

import PyPDF2
def read_pdf_to_text(file_path):
    with open(file_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        contents_list = []
        for page in pdf_reader.pages:
            content = page.extract_text()
            contents_list.append(content)
    return '\n'.join(contents_list)
read_pdf_to_text('xxx.pdf')

讀取Word文本：docx2txt

doc需先手動轉換成docx

import docx2txt
def read_docx_to_text(file_path):
    text = docx2txt.process(file_path)
    return text
read_docx_to_text('xxx.docx')

讀取excel文本：pandas

當然，pandas能讀取的文件不僅僅是excel，還包括csv、json等。

import pandas as pd
def read_excel_to_text(file_path):
    excel_file = pd.ExcelFile(file_path)
    sheet_names = excel_file.sheet_names
    text_list = []
    for sheet_name in sheet_names:
        df = excel_file.parse(sheet_name)
        text = df.to_string(index=False)
        text_list.append(text)
    return '\n'.join(text_list)
read_excel_to_text('xxx.xlsx')

讀取ppt文本：pptx

from pptx import Presentation
def read_pptx_to_text(file_path):
    prs = Presentation(file_path)
    text_list = []
    for slide in prs.slides:
        for shape in slide.shapes:
            if shape.has_text_frame:
                text_frame = shape.text_frame
                text = text_frame.text
                if text:
                    text_list.append(text)
    return '\n'.join(text_list)
read_pptx_to_text('xxx.pptx')

讀取csv、txt其他文本：直接open，read()

def read_txt_to_text(file_path):
    with open(file_path, 'r') as f:
        text = f.read()
    return text
read_txt_to_text('xxx.csv')
read_txt_to_text('xxx.txt')

讀取任何文件格式

有了前面的所有函數，那我們可以寫一個支持傳任意格式文件的函數。

support = {
    'pdf': 'read_pdf_to_text',
    'docx': 'read_docx_to_text',
    'xlsx': 'read_excel_to_text',
    'pptx': 'read_pptx_to_text',
    'csv': 'read_txt_to_text',
    'txt': 'read_txt_to_text',
}
def read_any_file_to_text(file_path):
    file_suffix = file_path.split('.')[-1]
    func = support.get(file_suffix)
    if func is None:
        return '暫不支持該文件格式'
    text = eval(func)(file_path)
    return text
read_any_file_to_text('xxx.pdf')
read_any_file_to_text('xxx.docx')
read_any_file_to_text('xxx.xlsx')
read_any_file_to_text('xxx.pptx')
read_any_file_to_text('xxx.csv')
read_any_file_to_text('xxx.txt')