快捷導(dǎo)航

Python爬蟲(chóng)獲取JavaScript動(dòng)態(tài)渲染后的網(wǎng)頁(yè)內(nèi)容四種方法

更新時(shí)間：2025年06月03日 09:29:00 作者：小白學(xué)大數(shù)據(jù)

在爬取動(dòng)態(tài)網(wǎng)頁(yè)數(shù)據(jù)時(shí)我們需要模擬客戶端瀏覽器環(huán)境,讓JavaScript能夠正常地執(zhí)行,并獲取渲染后的頁(yè)面數(shù)據(jù),這篇文章主要介紹了Python爬蟲(chóng)獲取JavaScript動(dòng)態(tài)渲染后的網(wǎng)頁(yè)內(nèi)容四種方法,需要的朋友可以參考下

1. 引言

在現(xiàn)代Web開(kāi)發(fā)中，許多網(wǎng)站采用JavaScript動(dòng)態(tài)渲染技術(shù)（如React、Vue、Angular等框架）來(lái)加載數(shù)據(jù)，傳統(tǒng)的HTTP請(qǐng)求（如Python的**requests**庫(kù)）只能獲取初始HTML，而無(wú)法捕獲JS執(zhí)行后的內(nèi)容。因此，爬取這類動(dòng)態(tài)網(wǎng)頁(yè)需要模擬瀏覽器行為，等待JavaScript執(zhí)行完成后再提取數(shù)據(jù)。

本文將介紹幾種主流方法，包括：

Selenium（自動(dòng)化瀏覽器操作）
Playwright（新一代瀏覽器自動(dòng)化工具）
Pyppeteer（Python版Puppeteer）
Requests-HTML（輕量級(jí)HTML解析庫(kù)）

并提供詳細(xì)的代碼實(shí)現(xiàn)，幫助開(kāi)發(fā)者高效抓取動(dòng)態(tài)渲染的網(wǎng)頁(yè)內(nèi)容。

方法1：使用Selenium獲取動(dòng)態(tài)內(nèi)容

Selenium是一個(gè)自動(dòng)化測(cè)試工具，可控制瀏覽器（如Chrome、Firefox）加載完整頁(yè)面。

示例代碼

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time

# 配置Chrome無(wú)頭模式
chrome_options = Options()
chrome_options.add_argument("--headless")  # 無(wú)界面運(yùn)行
chrome_options.add_argument("--disable-gpu")

# 指定ChromeDriver路徑
service = Service(executable_path="/path/to/chromedriver")
driver = webdriver.Chrome(service=service, options=chrome_options)

# 訪問(wèn)目標(biāo)網(wǎng)頁(yè)
url = "https://example.com"
driver.get(url)

# 等待JS執(zhí)行（可替換為顯式等待）
time.sleep(3)  # 簡(jiǎn)單等待，實(shí)際建議使用WebDriverWait

# 獲取渲染后的HTML
rendered_html = driver.page_source
print(rendered_html)  # 包含JS動(dòng)態(tài)加載的內(nèi)容

# 提取特定元素
element = driver.find_element(By.CSS_SELECTOR, "div.dynamic-content")
print(element.text)

# 關(guān)閉瀏覽器
driver.quit()

優(yōu)缺點(diǎn)

優(yōu)點(diǎn)：支持所有主流瀏覽器，適合復(fù)雜交互（如點(diǎn)擊、滾動(dòng)）。
缺點(diǎn)：速度較慢，占用資源多。

方法2：使用Playwright（推薦）

Playwright是微軟推出的新一代瀏覽器自動(dòng)化工具，比Selenium更快且更穩(wěn)定。

示例代碼

from playwright.sync_api import sync_playwright

# 代理配置
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

with sync_playwright() as p:
    # 啟動(dòng)Chromium瀏覽器并配置代理
    browser = p.chromium.launch(
        headless=True,  # 無(wú)頭模式
        proxy={
            "server": f"http://{proxyHost}:{proxyPort}",
            "username": proxyUser,
            "password": proxyPass,
        }
    )
    
    # 創(chuàng)建新頁(yè)面
    page = browser.new_page()
    
    try:
        # 訪問(wèn)網(wǎng)頁(yè)并等待加載
        page.goto("https://example.com", timeout=10000)  # 增加超時(shí)設(shè)置
        page.wait_for_selector("div.dynamic-content")  # 等待目標(biāo)元素出現(xiàn)
        
        # 獲取渲染后的HTML
        rendered_html = page.content()
        print(rendered_html)
        
        # 提取數(shù)據(jù)
        element = page.query_selector("div.dynamic-content")
        if element:
            print(element.inner_text())
        else:
            print("目標(biāo)元素未找到")
            
    except Exception as e:
        print(f"發(fā)生錯(cuò)誤: {e}")
        
    finally:
        # 確保瀏覽器關(guān)閉
        browser.close()

優(yōu)缺點(diǎn)

優(yōu)點(diǎn)：速度快，支持多瀏覽器（Chromium、Firefox、WebKit），API更現(xiàn)代化。
缺點(diǎn)：較新，社區(qū)資源略少于Selenium。

方法3：使用Pyppeteer（Python版Puppeteer）

Pyppeteer是基于Chrome DevTools Protocol的Python庫(kù)，適合高效抓取動(dòng)態(tài)內(nèi)容。

示例代碼

import asyncio
from pyppeteer import launch

async def fetch_rendered_html():
    # 啟動(dòng)瀏覽器
    browser = await launch(headless=True)
    page = await browser.newPage()
    
    # 訪問(wèn)網(wǎng)頁(yè)
    await page.goto("https://example.com")
    await page.waitForSelector("div.dynamic-content")  # 等待元素加載
    
    # 獲取HTML
    rendered_html = await page.content()
    print(rendered_html)
    
    # 提取數(shù)據(jù)
    element = await page.querySelector("div.dynamic-content")
    text = await page.evaluate("(element) => element.textContent", element)
    print(text)
    
    # 關(guān)閉瀏覽器
    await browser.close()

# 運(yùn)行異步任務(wù)
asyncio.get_event_loop().run_until_complete(fetch_rendered_html())

優(yōu)缺點(diǎn)

優(yōu)點(diǎn)：輕量級(jí)，直接控制Chrome，適合高性能爬取。
缺點(diǎn)：僅支持Chromium，異步編程可能增加復(fù)雜度。

方法4：使用Requests-HTML（輕量級(jí)方案）

Requests-HTML結(jié)合了**requests**和**pyppeteer**，適合簡(jiǎn)單動(dòng)態(tài)頁(yè)面。

示例代碼

from requests_html import HTMLSession

session = HTMLSession()
url = "https://example.com"

# 渲染JS
response = session.get(url)
response.html.render(timeout=20)  # 等待JS執(zhí)行

# 獲取渲染后的HTML
rendered_html = response.html.html
print(rendered_html)

# 提取數(shù)據(jù)
element = response.html.find("div.dynamic-content", first=True)
print(element.text)

優(yōu)缺點(diǎn)

優(yōu)點(diǎn)：API簡(jiǎn)單，適合小型爬蟲(chóng)。
缺點(diǎn)：功能有限，不適合復(fù)雜頁(yè)面。

總結(jié)與選擇建議

方法	適用場(chǎng)景	速度	復(fù)雜度
Selenium	需要兼容多種瀏覽器	慢	中等
Playwright	高性能、現(xiàn)代瀏覽器自動(dòng)化	快	低
Pyppeteer	直接控制Chrome	快	中高
Requests-HTML	輕量級(jí)動(dòng)態(tài)渲染	中	低

推薦選擇：

優(yōu)先使用 Playwright（速度快，API友好）。
如果需要兼容舊項(xiàng)目，可選擇 Selenium。
小型爬蟲(chóng)可嘗試 Requests-HTML。

結(jié)語(yǔ)

本文介紹了4種Python爬取JavaScript動(dòng)態(tài)渲染內(nèi)容的方法，并提供了完整代碼示例。動(dòng)態(tài)網(wǎng)頁(yè)抓取的關(guān)鍵在于模擬瀏覽器行為，開(kāi)發(fā)者可根據(jù)需求選擇合適方案。未來(lái)，隨著前端技術(shù)的發(fā)展，爬蟲(chóng)可能需要更智能的反反爬策略（如模擬用戶行為、破解加密API等）。

到此這篇關(guān)于Python爬蟲(chóng)獲取JavaScript動(dòng)態(tài)渲染后的網(wǎng)頁(yè)內(nèi)容四種方法的文章就介紹到這了,更多相關(guān)Python爬蟲(chóng)獲取JS動(dòng)態(tài)渲染后網(wǎng)頁(yè)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！