腳本之家服務(wù)器常用軟件

快捷導(dǎo)航

python讀取Excel大文件的四種方法與優(yōu)化

更新時間：2025年09月17日 15:16:00 作者：靈光通碼

這篇文章主要為大家詳細介紹了python讀取Excel大文件的四種方法與優(yōu)化技巧,文中的示例代碼講解詳細,感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下

核心方法

逐行讀取 - 最常用，內(nèi)存占用O(1)

分塊讀取 - 適合超大文件，可控制內(nèi)存使用

內(nèi)存映射 - 高性能，虛擬內(nèi)存映射

緩沖讀取 - 平衡性能和內(nèi)存

特殊場景處理

CSV文件 - 使用pandas的chunksize參數(shù)

JSON Lines - 逐行解析JSON對象

文本分析 - 內(nèi)存高效的單詞計數(shù)示例

關(guān)鍵優(yōu)化技巧

使用生成器 - 避免一次性加載所有數(shù)據(jù)到內(nèi)存

合理設(shè)置塊大小 - 平衡內(nèi)存使用和IO效率

進度監(jiān)控 - 實時顯示處理進度

錯誤處理 - 處理編碼錯誤、文件不存在等異常

使用建議

小于100MB: 直接讀取到內(nèi)存
100MB-1GB: 使用逐行讀取或小塊讀取
大于1GB: 使用內(nèi)存映射或大塊分批處理
結(jié)構(gòu)化數(shù)據(jù): 使用pandas的chunksize參數(shù)

這些方法可以處理幾GB甚至幾十GB的文件而不會導(dǎo)致內(nèi)存溢出。根據(jù)您的具體需求選擇最適合的方法即可。

示例代碼：

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
大文件讀取示例 - 避免內(nèi)存溢出的多種方法
"""
 
import os
import sys
import mmap
import csv
import json
from typing import Generator, Iterator
import pandas as pd
 
 
class LargeFileReader:
    """大文件讀取工具類"""
    
    def __init__(self, file_path: str, encoding: str = 'utf-8'):
        self.file_path = file_path
        self.encoding = encoding
        
    def get_file_size(self) -> int:
        """獲取文件大?。ㄗ止?jié)）"""
        return os.path.getsize(self.file_path)
    
    def get_file_size_mb(self) -> float:
        """獲取文件大?。∕B）"""
        return self.get_file_size() / (1024 * 1024)
 
 
def method1_line_by_line(file_path: str, encoding: str = 'utf-8') -> Generator[str, None, None]:
    """
    方法1: 逐行讀取 - 最常用的方法
    內(nèi)存使用量: O(1) - 每次只加載一行
    適用場景: 文本文件、日志文件
    """
    try:
        with open(file_path, 'r', encoding=encoding) as file:
            for line_num, line in enumerate(file, 1):
                # 處理每一行
                yield line.strip()  # 去除行尾換行符
                
                # 可選：顯示進度
                if line_num % 10000 == 0:
                    print(f"已處理 {line_num} 行")
                    
    except FileNotFoundError:
        print(f"文件未找到: {file_path}")
    except UnicodeDecodeError as e:
        print(f"編碼錯誤: {e}")
 
 
def method2_chunk_reading(file_path: str, chunk_size: int = 1024*1024, encoding: str = 'utf-8') -> Generator[str, None, None]:
    """
    方法2: 按塊讀取 - 適合處理二進制文件或超大文本文件
    內(nèi)存使用量: O(chunk_size) - 每次加載指定大小的塊
    適用場景: 二進制文件、超大文本文件
    """
    try:
        with open(file_path, 'r', encoding=encoding) as file:
            while True:
                chunk = file.read(chunk_size)
                if not chunk:
                    break
                yield chunk
                
    except FileNotFoundError:
        print(f"文件未找到: {file_path}")
    except UnicodeDecodeError as e:
        print(f"編碼錯誤: {e}")
 
 
def method3_mmap_reading(file_path: str, encoding: str = 'utf-8') -> Iterator[str]:
    """
    方法3: 內(nèi)存映射文件 - 高性能讀取
    內(nèi)存使用量: 虛擬內(nèi)存映射，物理內(nèi)存按需加載
    適用場景: 需要隨機訪問的大文件、高性能要求
    """
    try:
        with open(file_path, 'r', encoding=encoding) as file:
            with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mmapped_file:
                # 逐行讀取
                for line in iter(mmapped_file.readline, b""):
                    yield line.decode(encoding).strip()
                    
    except FileNotFoundError:
        print(f"文件未找到: {file_path}")
    except Exception as e:
        print(f"內(nèi)存映射錯誤: {e}")
 
 
def method4_buffered_reading(file_path: str, buffer_size: int = 8192, encoding: str = 'utf-8') -> Generator[str, None, None]:
    """
    方法4: 緩沖區(qū)讀取 - 平衡性能和內(nèi)存使用
    內(nèi)存使用量: O(buffer_size)
    適用場景: 需要自定義緩沖區(qū)大小的場景
    """
    try:
        with open(file_path, 'r', encoding=encoding, buffering=buffer_size) as file:
            for line in file:
                yield line.strip()
                
    except FileNotFoundError:
        print(f"文件未找到: {file_path}")
    except UnicodeDecodeError as e:
        print(f"編碼錯誤: {e}")
 
 
def process_large_csv(file_path: str, chunk_size: int = 10000) -> None:
    """
    處理大型CSV文件的示例
    使用pandas的chunksize參數(shù)分塊讀取
    """
    print(f"開始處理CSV文件: {file_path}")
    
    try:
        # 分塊讀取CSV文件
        chunk_iter = pd.read_csv(file_path, chunksize=chunk_size)
        
        total_rows = 0
        for chunk_num, chunk in enumerate(chunk_iter, 1):
            # 處理當前塊
            print(f"處理第 {chunk_num} 塊，包含 {len(chunk)} 行")
            
            # 示例處理：統(tǒng)計每列的基本信息
            print(f"列名: {list(chunk.columns)}")
            print(f"數(shù)據(jù)類型: {chunk.dtypes.to_dict()}")
            
            # 這里可以添加你的數(shù)據(jù)處理邏輯
            # 例如：數(shù)據(jù)清洗、計算、轉(zhuǎn)換等
            
            total_rows += len(chunk)
            
            # 可選：限制處理的塊數(shù)量（用于測試）
            if chunk_num >= 5:  # 只處理前5塊
                break
                
        print(f"總共處理了 {total_rows} 行數(shù)據(jù)")
        
    except FileNotFoundError:
        print(f"CSV文件未找到: {file_path}")
    except pd.errors.EmptyDataError:
        print("CSV文件為空")
    except Exception as e:
        print(f"處理CSV文件時出錯: {e}")
 
 
def process_large_json_lines(file_path: str) -> None:
    """
    處理大型JSON Lines文件 (.jsonl)
    每行是一個獨立的JSON對象
    """
    print(f"開始處理JSON Lines文件: {file_path}")
    
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            for line_num, line in enumerate(file, 1):
                line = line.strip()
                if not line:
                    continue
                    
                try:
                    # 解析JSON對象
                    json_obj = json.loads(line)
                    
                    # 處理JSON對象
                    # 這里添加你的處理邏輯
                    print(f"第 {line_num} 行: {type(json_obj)} - {len(str(json_obj))} 字符")
                    
                    # 示例：提取特定字段
                    if isinstance(json_obj, dict):
                        keys = list(json_obj.keys())[:5]  # 只顯示前5個鍵
                        print(f"  鍵: {keys}")
                    
                except json.JSONDecodeError as e:
                    print(f"第 {line_num} 行JSON解析錯誤: {e}")
                    continue
                
                # 顯示進度
                if line_num % 1000 == 0:
                    print(f"已處理 {line_num} 行")
                    
                # 可選：限制處理行數(shù)（用于測試）
                if line_num >= 10000:
                    break
                    
    except FileNotFoundError:
        print(f"JSON Lines文件未找到: {file_path}")
 
 
def process_with_progress_callback(file_path: str, callback_interval: int = 10000) -> None:
    """
    帶進度回調(diào)的文件處理示例
    """
    reader = LargeFileReader(file_path)
    file_size_mb = reader.get_file_size_mb()
    
    print(f"文件大小: {file_size_mb:.2f} MB")
    print("開始處理文件...")
    
    processed_lines = 0
    
    for line in method1_line_by_line(file_path):
        # 處理每一行
        # 這里添加你的處理邏輯
        line_length = len(line)
        
        processed_lines += 1
        
        # 進度回調(diào)
        if processed_lines % callback_interval == 0:
            print(f"已處理 {processed_lines:,} 行")
            
        # 可選：限制處理行數(shù)（用于測試）
        if processed_lines >= 50000:
            print("達到處理限制，停止處理")
            break
    
    print(f"處理完成，總共處理了 {processed_lines:,} 行")
 
 
def memory_efficient_word_count(file_path: str) -> dict:
    """
    內(nèi)存高效的單詞計數(shù)示例
    適用于超大文本文件
    """
    word_count = {}
    
    print("開始統(tǒng)計單詞頻率...")
    
    for line_num, line in enumerate(method1_line_by_line(file_path), 1):
        # 簡單的單詞分割（可以根據(jù)需要改進）
        words = line.lower().split()
        
        for word in words:
            # 清理單詞（去除標點符號等）
            clean_word = ''.join(c for c in word if c.isalnum())
            if clean_word:
                word_count[clean_word] = word_count.get(clean_word, 0) + 1
        
        # 顯示進度
        if line_num % 10000 == 0:
            print(f"已處理 {line_num} 行，當前詞匯量: {len(word_count)}")
    
    print(f"統(tǒng)計完成，總詞匯量: {len(word_count)}")
    
    # 返回前10個最常用的單詞
    sorted_words = sorted(word_count.items(), key=lambda x: x[1], reverse=True)
    return dict(sorted_words[:10])
 
 
def main():
    """主函數(shù) - 演示各種大文件讀取方法"""
    
    # 注意：請?zhí)鎿Q為你的實際文件路徑
    test_file = "large_file.txt"  # 替換為實際的大文件路徑
    csv_file = "large_data.csv"   # 替換為實際的CSV文件路徑
    json_file = "large_data.jsonl"  # 替換為實際的JSON Lines文件路徑
    
    print("=== 大文件讀取示例 ===\n")
    
    # 檢查文件是否存在
    if not os.path.exists(test_file):
        print(f"測試文件 {test_file} 不存在")
        print("請創(chuàng)建一個測試文件或修改文件路徑")
        return
    
    # 創(chuàng)建文件讀取器
    reader = LargeFileReader(test_file)
    
    print(f"文件路徑: {test_file}")
    print(f"文件大小: {reader.get_file_size_mb():.2f} MB\n")
    
    # 示例1: 逐行讀?。ㄍ扑]用于大多數(shù)文本文件）
    print("=== 方法1: 逐行讀取 ===")
    line_count = 0
    for line in method1_line_by_line(test_file):
        line_count += 1
        if line_count <= 5:  # 只顯示前5行
            print(f"第{line_count}行: {line[:50]}...")
        if line_count >= 10000:  # 限制處理行數(shù)
            break
    print(f"處理了 {line_count} 行\(zhòng)n")
    
    # 示例2: 塊讀取
    print("=== 方法2: 塊讀取 ===")
    chunk_count = 0
    for chunk in method2_chunk_reading(test_file, chunk_size=1024):
        chunk_count += 1
        if chunk_count <= 3:  # 只顯示前3塊
            print(f"塊{chunk_count}: {len(chunk)} 字符")
        if chunk_count >= 10:  # 限制處理塊數(shù)
            break
    print(f"處理了 {chunk_count} 個塊\n")
    
    # 示例3: 內(nèi)存映射
    print("=== 方法3: 內(nèi)存映射 ===")
    mmap_count = 0
    try:
        for line in method3_mmap_reading(test_file):
            mmap_count += 1
            if mmap_count >= 10000:  # 限制處理行數(shù)
                break
        print(f"使用內(nèi)存映射處理了 {mmap_count} 行\(zhòng)n")
    except Exception as e:
        print(f"內(nèi)存映射失敗: {e}\n")
    
    # 示例4: 帶進度的處理
    print("=== 方法4: 帶進度回調(diào)的處理 ===")
    process_with_progress_callback(test_file, callback_interval=5000)
    print()
    
    # 示例5: CSV文件處理
    if os.path.exists(csv_file):
        print("=== CSV文件處理 ===")
        process_large_csv(csv_file, chunk_size=1000)
        print()
    
    # 示例6: JSON Lines文件處理
    if os.path.exists(json_file):
        print("=== JSON Lines文件處理 ===")
        process_large_json_lines(json_file)
        print()
    
    # 示例7: 單詞計數(shù)
    print("=== 內(nèi)存高效單詞計數(shù) ===")
    try:
        top_words = memory_efficient_word_count(test_file)
        print("前10個最常用單詞:")
        for word, count in top_words.items():
            print(f"  {word}: {count}")
    except Exception as e:
        print(f"單詞統(tǒng)計失敗: {e}")
 
 
if __name__ == "__main__":
    main()

性能優(yōu)化建議:

1. 選擇合適的方法:

逐行讀取: 適用于大多數(shù)文本文件
塊讀取: 適用于二進制文件或需要自定義處理塊的場景
內(nèi)存映射: 適用于需要隨機訪問或高性能要求的場景
pandas分塊: 適用于結(jié)構(gòu)化數(shù)據(jù)(CSV)

2. 內(nèi)存優(yōu)化:

及時釋放不需要的變量
使用生成器而不是列表
避免一次性加載整個文件到內(nèi)存

3. 性能優(yōu)化:

合理設(shè)置緩沖區(qū)大小
使用適當?shù)木幋a
考慮使用多進程/多線程處理

4. 錯誤處理:

處理文件不存在的情況
處理編碼錯誤
處理磁盤空間不足等IO錯誤

Excel大文件讀取方法

1. pandas分塊讀取 (.xlsx, .xls)

適合中等大小的Excel文件

可以處理多個工作表

支持數(shù)據(jù)類型自動識別

2. openpyxl逐行讀取 (.xlsx) - 推薦

內(nèi)存效率最高的方法

使用read_only=True模式

真正的逐行處理，內(nèi)存占用O(1)

適合處理超大Excel文件

3. xlrd處理 (.xls)

專門處理舊版Excel格式

分塊讀取支持

適合Legacy Excel文件

4. pyxlsb處理 (.xlsb)

處理Excel二進制格式

讀取速度快，文件小

需要額外安裝pyxlsb庫

新增功能特點

智能文件信息獲取 - 不加載全部數(shù)據(jù)就能獲取文件結(jié)構(gòu)信息

內(nèi)存使用對比 - 實時監(jiān)控不同方法的內(nèi)存消耗

批量數(shù)據(jù)處理 - 支持批次處理和進度監(jiān)控

多格式支持 - 支持.xlsx、.xls、.xlsb三種格式

錯誤處理 - 完善的異常處理機制

安裝依賴

pip install pandas openpyxl xlrd pyxlsb psutil

使用建議

小于50MB: 可以使用pandas直接讀取
50MB-500MB: 使用pandas分塊讀取
大于500MB: 推薦使用openpyxl逐行讀取
超大文件: 考慮轉(zhuǎn)換為CSV或Parquet格式

這套方案可以處理幾GB甚至更大的Excel文件而不會內(nèi)存溢出，特別是openpyxl的逐行讀取方法，是處理超大Excel文件的最佳選擇！

到此這篇關(guān)于python讀取Excel大文件的四種方法與優(yōu)化的文章就介紹到這了,更多相關(guān)python讀取Excel內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

python讀取Excel大文件的四種方法與優(yōu)化

目錄

核心方法

特殊場景處理

關(guān)鍵優(yōu)化技巧

使用建議

Excel大文件讀取方法

1. pandas分塊讀取 (.xlsx, .xls)

2. openpyxl逐行讀取 (.xlsx) - 推薦

3. xlrd處理 (.xls)

4. pyxlsb處理 (.xlsb)

新增功能特點

安裝依賴

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具