快捷導(dǎo)航

Python使用aiohttp實(shí)現(xiàn)每秒千次的網(wǎng)頁(yè)抓取

更新時(shí)間：2025年08月28日 10:43:56 作者：小白學(xué)大數(shù)據(jù)

在當(dāng)今大數(shù)據(jù)時(shí)代,高效的網(wǎng)絡(luò)爬蟲是數(shù)據(jù)采集的關(guān)鍵工具,傳統(tǒng)的同步爬蟲由于受限于I/O阻塞,難以實(shí)現(xiàn)高并發(fā)請(qǐng)求,而Python的aiohttp可以輕松實(shí)現(xiàn)異步高并發(fā)爬蟲,達(dá)到每秒千次甚至更高的請(qǐng)求速率,所以本文介紹了Python如何使用aiohttp實(shí)現(xiàn)每秒千次的網(wǎng)頁(yè)抓取

引言

在當(dāng)今大數(shù)據(jù)時(shí)代，高效的網(wǎng)絡(luò)爬蟲是數(shù)據(jù)采集的關(guān)鍵工具。傳統(tǒng)的同步爬蟲（如**requests**庫(kù)）由于受限于I/O阻塞，難以實(shí)現(xiàn)高并發(fā)請(qǐng)求。而Python的**aiohttp**庫(kù)結(jié)合**asyncio**，可以輕松實(shí)現(xiàn)異步高并發(fā)爬蟲，達(dá)到每秒千次甚至更高的請(qǐng)求速率。

本文將詳細(xì)介紹如何使用**aiohttp**構(gòu)建一個(gè)高性能爬蟲，涵蓋以下內(nèi)容：

aiohttp的基本原理與優(yōu)勢(shì)
搭建異步爬蟲框架
優(yōu)化并發(fā)請(qǐng)求（連接池、超時(shí)控制）
代理IP與User-Agent輪換（應(yīng)對(duì)反爬）
性能測(cè)試與優(yōu)化（實(shí)現(xiàn)1000+ QPS）

最后，我們將提供一個(gè)完整的代碼示例，并進(jìn)行基準(zhǔn)測(cè)試，展示如何真正實(shí)現(xiàn)每秒千次的網(wǎng)頁(yè)抓取。

1. aiohttp的基本原理與優(yōu)勢(shì)

1.1 同步 vs. 異步爬蟲

同步爬蟲（如requests）：每個(gè)請(qǐng)求必須等待服務(wù)器響應(yīng)后才能繼續(xù)下一個(gè)請(qǐng)求，I/O阻塞導(dǎo)致性能低下。
異步爬蟲（aiohttp + asyncio）：利用事件循環(huán)（Event Loop）實(shí)現(xiàn)非阻塞I/O，多個(gè)請(qǐng)求可同時(shí)進(jìn)行，極大提高并發(fā)能力。

1.2 aiohttp的核心組件

**ClientSession**：管理HTTP連接池，復(fù)用TCP連接，減少握手開銷。
**async/await**語法：Python 3.5+的異步編程方式，使代碼更簡(jiǎn)潔。
**asyncio.gather()**：并發(fā)執(zhí)行多個(gè)協(xié)程任務(wù)。

2. 搭建異步爬蟲框架

2.1 安裝依賴

2.2 基礎(chǔ)爬蟲示例

import aiohttp
import asyncio
from bs4 import BeautifulSoup

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def parse(url):
    async with aiohttp.ClientSession() as session:
        html = await fetch(session, url)
        soup = BeautifulSoup(html, 'html.parser')
        title = soup.title.string
        print(f"URL: {url} | Title: {title}")

async def main(urls):
    tasks = [parse(url) for url in urls]
    await asyncio.gather(*tasks)

if __name__ == "__main__":
    urls = [
        "https://example.com",
        "https://python.org",
        "https://aiohttp.readthedocs.io",
    ]
    asyncio.run(main(urls))

代碼解析：

**fetch()** 發(fā)起HTTP請(qǐng)求并返回HTML。
**parse()** 解析HTML并提取標(biāo)題。
**main()** 使用**asyncio.gather()**并發(fā)執(zhí)行多個(gè)任務(wù)。

3. 優(yōu)化并發(fā)請(qǐng)求（實(shí)現(xiàn)1000+ QPS）

3.1 使用連接池（TCP Keep-Alive）

默認(rèn)情況下，**aiohttp**會(huì)自動(dòng)復(fù)用TCP連接，但我們可以手動(dòng)優(yōu)化：

conn = aiohttp.TCPConnector(limit=100, force_close=False)  # 最大100個(gè)連接
async with aiohttp.ClientSession(connector=conn) as session:
    # 發(fā)起請(qǐng)求...

3.2 控制并發(fā)量（Semaphore）

避免因請(qǐng)求過多被目標(biāo)網(wǎng)站封禁：

semaphore = asyncio.Semaphore(100)  # 限制并發(fā)數(shù)為100

async def fetch(session, url):
    async with semaphore:
        async with session.get(url) as response:
            return await response.text()

3.3 超時(shí)設(shè)置

防止某些請(qǐng)求卡住整個(gè)爬蟲：

timeout = aiohttp.ClientTimeout(total=10)  # 10秒超時(shí)
async with session.get(url, timeout=timeout) as response:
    # 處理響應(yīng)...

4. 代理IP與User-Agent輪換（應(yīng)對(duì)反爬）

4.1 隨機(jī)User-Agent

from fake_useragent import UserAgent

ua = UserAgent()
headers = {"User-Agent": ua.random}

async def fetch(session, url):
    async with session.get(url, headers=headers) as response:
        return await response.text()

4.2 代理IP池

import aiohttp
import asyncio
from fake_useragent import UserAgent

# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

# 構(gòu)建帶認(rèn)證的代理URL
proxy_auth = aiohttp.BasicAuth(proxyUser, proxyPass)
proxy_url = f"http://{proxyHost}:{proxyPort}"

ua = UserAgent()
semaphore = asyncio.Semaphore(100)  # 限制并發(fā)數(shù)

async def fetch(session, url):
    headers = {"User-Agent": ua.random}
    timeout = aiohttp.ClientTimeout(total=10)
    async with semaphore:
        async with session.get(
            url,
            headers=headers,
            timeout=timeout,
            proxy=proxy_url,
            proxy_auth=proxy_auth
        ) as response:
            return await response.text()

async def main(urls):
    conn = aiohttp.TCPConnector(limit=100, force_close=False)
    async with aiohttp.ClientSession(connector=conn) as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

if __name__ == "__main__":
    urls = ["https://example.com"] * 1000
    asyncio.run(main(urls))

5. 性能測(cè)試（實(shí)現(xiàn)1000+ QPS）

5.1 基準(zhǔn)測(cè)試代碼

import time

async def benchmark():
    urls = ["https://example.com"] * 1000  # 測(cè)試1000次請(qǐng)求
    start = time.time()
    await main(urls)
    end = time.time()
    qps = len(urls) / (end - start)
    print(f"QPS: {qps:.2f}")

asyncio.run(benchmark())

5.2 優(yōu)化后的完整代碼

import aiohttp
import asyncio
from fake_useragent import UserAgent

ua = UserAgent()
semaphore = asyncio.Semaphore(100)  # 限制并發(fā)數(shù)

async def fetch(session, url):
    headers = {"User-Agent": ua.random}
    timeout = aiohttp.ClientTimeout(total=10)
    async with semaphore:
        async with session.get(url, headers=headers, timeout=timeout) as response:
            return await response.text()

async def main(urls):
    conn = aiohttp.TCPConnector(limit=100, force_close=False)
    async with aiohttp.ClientSession(connector=conn) as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

if __name__ == "__main__":
    urls = ["https://example.com"] * 1000
    asyncio.run(main(urls))