腳本之家服務器常用軟件

快捷導航

Scrapy將數(shù)據(jù)保存到Excel和MySQL中的方法實現(xiàn)

更新時間：2023年02月28日 15:42:28 作者：就是搞笑

本文主要介紹了Scrapy將數(shù)據(jù)保存到Excel和MySQL中的方法實現(xiàn)，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧

1. Excel

主要講解兩種方式：openpyxl和pandas

1.1 openpyxl

class ExcelPipeline:
    def __init__(self):
        # 創(chuàng)建Excel文件
        self.wb = Workbook()
        # 選取第一個工作表
        self.ws = self.wb.active
        # 寫入表頭
        self.ws.append(['title', 'link', 'country',
                        'author', 'translator', 'publisher',
                        'time', 'price', 'star', 'score',
                        'people', 'comment'
                        ])

    def process_item(self, item, spider):
        self.ws.append([
            item.get('title', ''),
            item.get('link', ''),
            item.get('country', ''),
            item.get('author', ''),
            item.get('translator', ''),
            item.get('publisher', ''),
            item.get('time', ''),
            item.get('price', ''),
            item.get('star', ''),
            item.get('score', ''),
            item.get('people', ''),
            item.get('comment', '')
        ])
        return item

    def close_spider(self, spider):
        self.wb.save('result.xlsx')

1.1.1 代碼說明

ExcelPipeline 繼承自 Scrapy 的 Pipeline 類，并重寫了三個方法：__init__()、process_item() 和 close_spider()。

在 __init__() 方法中：

創(chuàng)建了一個 Excel 文件，并選取了第一個工作表。然后，我們寫入了表頭。
當然你也可以將這部分代碼寫在open_spider方法中

在 process_item() 方法中，我們將每一行的數(shù)據(jù)寫入到工作表中。

process_item 方法：

不會覆蓋之前已經(jīng)寫入的數(shù)據(jù)，它會在數(shù)據(jù)末尾追加新的行。
你調(diào)用多次 process_item 方法，每次都會在表格的末尾追加一行新數(shù)據(jù)。

在 close_spider() 方法中，我們保存 Excel 文件。

1.1.2 注意

可以發(fā)現(xiàn)我在process_item()方法中使用了item.get(key, default)：

考慮可能存在某些 item 中沒有某些鍵值的情況，這可能會導致程序出錯。

當然如果你已經(jīng)進行過數(shù)據(jù)處理也可以直接用item[key]。

使用了 item.get(key, default) 方法來獲取 item 中的鍵值，如果某個鍵不存在，則返回一個空字符串 ''

在 Scrapy 中，item 是一個字典類型，它由一系列鍵值對組成，每個鍵值對表示一個字段。在處理 item 時，我們通常需要從中獲取某個字段的值。使用字典的 get 方法可以方便地實現(xiàn)這個功能。

get 方法有兩個參數(shù)：key 表示要獲取的鍵，default 表示鍵不存在時的默認值。例如：

1.2 pandas

class ExcelPipeline:
    def __init__(self):
        # 創(chuàng)建一個空的數(shù)據(jù)框
        self.df = pd.DataFrame(columns=['title', 'link', 'country',
                                        'author', 'translator', 'publisher',
                                        'time', 'price', 'star', 'score',
                                        'people', 'comment'
                                        ])

    def process_item(self, item, spider):
        # 將數(shù)據(jù)添加到數(shù)據(jù)框中
        item['title'] = item.get('title', '')
        item['link'] = item.get('link', '')
        item['country'] = item.get('country', '')
        item['author'] = item.get('author', '')
        item['translator'] = item.get('translator', '')
        item['publisher'] = item.get('publisher', '')
        item['time'] = item.get('time', '')
        item['price'] = item.get('price', '')
        item['star'] = item.get('star', '')
        item['score'] = item.get('score', '')
        item['people'] = item.get('people', '')
        item['comment'] = item.get('comment', '')
        series = pd.Series(item)
        self.df = self.df.append(series, ignore_index=True)
        return item

    def close_spider(self, spider):
        # 將數(shù)據(jù)框保存到 Excel 文件中
        self.df.to_excel('result.xlsx', index=False)

1.2.1 代碼說明

定義了一個 ExcelPipeline 類，它包含了三個方法：__init__、process_item 和 close_spider。

__init__ 方法用于初始化類實例
process_item 方法用于處理每個爬取到的 item，將其添加到 items 列表中
close_spider 方法用于在爬蟲關閉時將 items 列表中的數(shù)據(jù)保存到 Excel 文件中。

1.2.2 常見錯誤

在代碼中有大量的item['title'] = item.get('title', '')類似代碼

你可以選擇不寫，但如果item中有一些字段的值為None，而pandas不支持將None類型的值添加到DataFrame中，會導致程序錯誤。這一點比openpyxl要嚴格的多。

字典對象轉換為Series對象

self.df是一個DataFrame對象，而item是一個字典對象。因此，需要將字典對象轉換為Series對象，然后再將其添加到DataFrame中。

series = pd.Series(item)
self.df = self.df.append(series, ignore_index=True)

only Series and DataFrame objs are valid這個錯誤一般就是發(fā)生在使用Pandas將數(shù)據(jù)轉換成DataFrame時，傳入的參數(shù)不是Series或DataFrame類型。

上面的代碼就是用來避免這個問題的。

1.3 openpyxl和pandas對比

pandas和openpyxl都是非常強大的Python數(shù)據(jù)處理庫，兩者在不同的場景下可以發(fā)揮出各自的優(yōu)勢。

如果需要處理大量的Excel文件，需要對文件進行復雜的操作，比如格式化、圖表等，那么openpyxl可能更適合，因為它專注于Excel文件的讀寫和操作，具有更高的靈活性和控制力。
如果數(shù)據(jù)已經(jīng)在Python中，且需要進行各種統(tǒng)計分析和處理，如數(shù)據(jù)聚合、數(shù)據(jù)透視表、數(shù)據(jù)分組、數(shù)據(jù)清洗、數(shù)據(jù)可視化等，那么pandas可能更適合，因為它提供了豐富的數(shù)據(jù)處理工具和函數(shù)。

總的來說，兩者都是很好的工具，具體使用哪一個取決于具體需求和場景。

2. MYSQL

可以使用Python的MySQL驅(qū)動程序，例如 mysql-connector-python 或 pymysql。主要將pymysql。

class MySQLPipeline:
    def __init__(self):
        # 連接 MySQL 數(shù)據(jù)庫
        self.conn = pymysql.connect(
            host='localhost',
            port=3306,
            user='root',
            password='your_password',
            database='your_database',
            charset='utf8mb4',
            cursorclass=pymysql.cursors.DictCursor
        )
        # 創(chuàng)建一個游標對象
        self.cursor = self.conn.cursor()
        # 創(chuàng)建表
        self.create_table()

    def create_table(self):
        # SQL 語句：創(chuàng)建數(shù)據(jù)表
        sql = '''CREATE TABLE IF NOT EXISTS `book` (
            `id` int(11) NOT NULL AUTO_INCREMENT,
            `title` varchar(255) NOT NULL,
            `link` varchar(255) NOT NULL,
            `country` varchar(255) NOT NULL,
            `author` varchar(255) NOT NULL,
            `translator` varchar(255) NOT NULL,
            `publisher` varchar(255) NOT NULL,
            `time` varchar(255) NOT NULL,
            `price` varchar(255) NOT NULL,
            `star` varchar(255) NOT NULL,
            `score` varchar(255) NOT NULL,
            `people` varchar(255) NOT NULL,
            `comment` varchar(255) NOT NULL,
            PRIMARY KEY (`id`)
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci'''
        # 執(zhí)行 SQL 語句
        self.cursor.execute(sql)
        # 提交事務
        self.conn.commit()

    def process_item(self, item, spider):
        # SQL 語句：插入數(shù)據(jù)
        sql = '''INSERT INTO `book` (
                `title`, `link`, `country`,
                `author`, `translator`, `publisher`,
                `time`, `price`, `star`, `score`,
                `people`, `comment`
            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'''
        # 執(zhí)行 SQL 語句
        self.cursor.execute(sql, (
            item['title'], item['link'], item['country'],
            item['author'], item['translator'], item['publisher'],
            item['time'], item['price'], item['star'], item['score'],
            item['people'], item['comment']
        ))
        # 提交事務
        self.conn.commit()
        return item

    def close_spider(self, spider):
        # 關閉游標對象
        self.cursor.close()
        # 關閉數(shù)據(jù)庫連接
        self.conn.close()

2.1 代碼說明

我們創(chuàng)建了一個名為MySQLPipeline的自定義ScrapyPipeline。

__init__方法中接收了MySQL數(shù)據(jù)庫的配置信息。

其中還調(diào)用了create_table，當然如果保證表已經(jīng)存在，也沒有必要這么寫

如果你嫌每次連接都要寫信息的話，可以在setting.py中定義MySQL相關變量：

請?zhí)砑訄D片描述

create_table方法創(chuàng)建表book

process_item方法用于將抓取的數(shù)據(jù)插入到數(shù)據(jù)庫表中。

close_spider方法用于關閉游標和連接。

2.2 pymysql介紹

2.2.1 游標對象

在Python中，連接數(shù)據(jù)庫時需要創(chuàng)建一個數(shù)據(jù)庫連接對象，然后通過這個連接對象創(chuàng)建一個游標對象。

游標對象是執(zhí)行數(shù)據(jù)庫操作的主要對象，它負責向數(shù)據(jù)庫發(fā)送查詢和獲取結果。

在Python中，常用的游標對象有Cursor、DictCursor、SSCursor等。

Cursor：普通游標（默認），返回結果為元組類型。
DictCursor：字典游標，返回結果為字典類型。
SSCursor：嵌套游標，可用于處理大數(shù)據(jù)集。

在獲取大量數(shù)據(jù)時效率比普通游標更高，但是會占用更多的系統(tǒng)資源。

與普通游標相比，嵌套游標不會將整個查詢結果讀入內(nèi)存，而是每次只讀取部分數(shù)據(jù)。

根據(jù)需要，選擇不同類型的游標對象可以方便我們對返回結果進行處理。

2.2.2 各種游標說明

創(chuàng)建連接對象時有這么一段代碼：

cursorclass=pymysql.cursors.DictCursor

用于設置游標返回的數(shù)據(jù)類型，默認返回的是元組(tuple)類型，設置為DictCursor后可以返回字典(dict)類型，更方便處理數(shù)據(jù)。一般使用普通游標就行了

三種游標主要是在查詢時的方式存在區(qū)別：

cur = conn.cursor()
cur.execute('SELECT * FROM my_table')
result = cur.fetchone()  # 獲取一條記錄，返回的是元組類型
# 普通游標
print(result[0])  # 訪問第一個字段的值
# 字典游標
print(result['id'])  # 訪問數(shù)據(jù)庫中字段名為 id 的字段的值，{'id': 1, 'name': 'Alice'}

# 嵌套游標
print(result[0])  # 訪問第一個字段的值

如果是查詢的多條數(shù)據(jù)，則返回的是元組或字典組成的列表：

# 普通游標
[(1, 'John', 'Doe'), (2, 'Jane', 'Doe'), (3, 'Bob', 'Smith')]
# 字典游標
[{'id': 1, 'first_name': 'John', 'last_name': 'Doe'}, {'id': 2, 'first_name': 'Jane', 'last_name': 'Doe'}, {'id': 3, 'first_name': 'Bob', 'last_name': 'Smith'}]