如何在scrapy中集成selenium爬取網(wǎng)頁(yè)的方法
1.背景
- 我們?cè)谂廊【W(wǎng)頁(yè)時(shí)一般會(huì)使用到三個(gè)爬蟲(chóng)庫(kù):requests,scrapy,selenium。requests一般用于小型爬蟲(chóng),scrapy用于構(gòu)建大的爬蟲(chóng)項(xiàng)目,而selenium主要用來(lái)應(yīng)付負(fù)責(zé)的頁(yè)面(復(fù)雜js渲染的頁(yè)面,請(qǐng)求非常難構(gòu)造,或者構(gòu)造方式經(jīng)常變化)。
- 在我們面對(duì)大型爬蟲(chóng)項(xiàng)目時(shí),肯定會(huì)優(yōu)選scrapy框架來(lái)開(kāi)發(fā),但是在解析復(fù)雜JS渲染的頁(yè)面時(shí),又很麻煩。 盡管使用selenium瀏覽器渲染來(lái)抓取這樣的頁(yè)面很方便,這種方式下,我們不需要關(guān)心頁(yè)面后臺(tái)發(fā)生了怎樣的請(qǐng)求,也不需要分析整個(gè)頁(yè)面的渲染過(guò)程,我們只需要關(guān)心頁(yè)面最終結(jié)果即可,可見(jiàn)即可爬,但是selenium的效率又太低。
- 所以,如果可以在scrapy中,集成selenium,讓selenium負(fù)責(zé)復(fù)雜頁(yè)面的爬取,那么這樣的爬蟲(chóng)就無(wú)敵了,可以爬取任何網(wǎng)站了。
2. 環(huán)境
- python 3.6.1
- 系統(tǒng):win7
- IDE:pycharm
- 安裝過(guò)chrome瀏覽器
- 配置好chromedriver(設(shè)置好環(huán)境變量)
- selenium 3.7.0
- scrapy 1.4.0
3.原理分析
3.1. 分析request請(qǐng)求的流程
首先看一下scrapy最新的架構(gòu)圖:

部分流程:
第一:爬蟲(chóng)引擎生成requests請(qǐng)求,送往scheduler調(diào)度模塊,進(jìn)入等待隊(duì)列,等待調(diào)度。
第二:scheduler模塊開(kāi)始調(diào)度這些requests,出隊(duì),發(fā)往爬蟲(chóng)引擎。
第三:爬蟲(chóng)引擎將這些requests送到下載中間件(多個(gè),例如加header,代理,自定義等等)進(jìn)行處理。
第四:處理完之后,送往Downloader模塊進(jìn)行下載。從這個(gè)處理過(guò)程來(lái)看,突破口就在下載中間件部分,用selenium直接處理掉request請(qǐng)求。
3.2. requests和response中間處理件源碼分析
相關(guān)代碼位置:

源碼解析:
# 文件:E:\Miniconda\Lib\site-packages\scrapy\core\downloader\middleware.py
"""
Downloader Middleware manager
See documentation in docs/topics/downloader-middleware.rst
"""
import six
from twisted.internet import defer
from scrapy.http import Request, Response
from scrapy.middleware import MiddlewareManager
from scrapy.utils.defer import mustbe_deferred
from scrapy.utils.conf import build_component_list
class DownloaderMiddlewareManager(MiddlewareManager):
component_name = 'downloader middleware'
@classmethod
def _get_mwlist_from_settings(cls, settings):
# 從settings.py或這custom_setting中拿到自定義的Middleware中間件
'''
'DOWNLOADER_MIDDLEWARES': {
'mySpider.middlewares.ProxiesMiddleware': 400,
# SeleniumMiddleware
'mySpider.middlewares.SeleniumMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
},
'''
return build_component_list(
settings.getwithbase('DOWNLOADER_MIDDLEWARES'))
# 將所有自定義Middleware中間件的處理函數(shù)添加到對(duì)應(yīng)的methods列表中
def _add_middleware(self, mw):
if hasattr(mw, 'process_request'):
self.methods['process_request'].append(mw.process_request)
if hasattr(mw, 'process_response'):
self.methods['process_response'].insert(0, mw.process_response)
if hasattr(mw, 'process_exception'):
self.methods['process_exception'].insert(0, mw.process_exception)
# 整個(gè)下載流程
def download(self, download_func, request, spider):
@defer.inlineCallbacks
def process_request(request):
# 處理request請(qǐng)求,依次經(jīng)過(guò)各個(gè)自定義Middleware中間件的process_request方法,前面有加入到list中
for method in self.methods['process_request']:
response = yield method(request=request, spider=spider)
assert response is None or isinstance(response, (Response, Request)), \
'Middleware %s.process_request must return None, Response or Request, got %s' % \
(six.get_method_self(method).__class__.__name__, response.__class__.__name__)
# 這是關(guān)鍵地方
# 如果在某個(gè)Middleware中間件的process_request中處理完之后,生成了一個(gè)response對(duì)象
# 那么會(huì)直接將這個(gè)response return 出去,跳出循環(huán),不再處理其他的process_request
# 之前我們的header,proxy中間件,都只是加個(gè)user-agent,加個(gè)proxy,并不做任何return值
# 還需要注意一點(diǎn):就是這個(gè)return的必須是Response對(duì)象
# 后面我們構(gòu)造的HtmlResponse正是Response的子類對(duì)象
if response:
defer.returnValue(response)
# 如果在上面的所有process_request中,都沒(méi)有返回任何Response對(duì)象的話
# 最后,會(huì)將這個(gè)加工過(guò)的Request送往download_func,進(jìn)行下載,返回的就是一個(gè)Response對(duì)象
# 然后依次經(jīng)過(guò)各個(gè)Middleware中間件的process_response方法進(jìn)行加工,如下
defer.returnValue((yield download_func(request=request,spider=spider)))
@defer.inlineCallbacks
def process_response(response):
assert response is not None, 'Received None in process_response'
if isinstance(response, Request):
defer.returnValue(response)
for method in self.methods['process_response']:
response = yield method(request=request, response=response,
spider=spider)
assert isinstance(response, (Response, Request)), \
'Middleware %s.process_response must return Response or Request, got %s' % \
(six.get_method_self(method).__class__.__name__, type(response))
if isinstance(response, Request):
defer.returnValue(response)
defer.returnValue(response)
@defer.inlineCallbacks
def process_exception(_failure):
exception = _failure.value
for method in self.methods['process_exception']:
response = yield method(request=request, exception=exception,
spider=spider)
assert response is None or isinstance(response, (Response, Request)), \
'Middleware %s.process_exception must return None, Response or Request, got %s' % \
(six.get_method_self(method).__class__.__name__, type(response))
if response:
defer.returnValue(response)
defer.returnValue(_failure)
deferred = mustbe_deferred(process_request, request)
deferred.addErrback(process_exception)
deferred.addCallback(process_response)
return deferred
4. 代碼
在settings.py中,配置好selenium參數(shù):
# 文件settings.py中 # ----------- selenium參數(shù)配置 ------------- SELENIUM_TIMEOUT = 25 # selenium瀏覽器的超時(shí)時(shí)間,單位秒 LOAD_IMAGE = True # 是否下載圖片 WINDOW_HEIGHT = 900 # 瀏覽器窗口大小 WINDOW_WIDTH = 900
在spider中,生成request時(shí),標(biāo)記哪些請(qǐng)求需要走selenium下載:
# 文件mySpider.py中
class mySpider(CrawlSpider):
name = "mySpiderAmazon"
allowed_domains = ['amazon.com']
custom_settings = {
'LOG_LEVEL':'INFO',
'DOWNLOAD_DELAY': 0,
'COOKIES_ENABLED': False, # enabled by default
'DOWNLOADER_MIDDLEWARES': {
# 代理中間件
'mySpider.middlewares.ProxiesMiddleware': 400,
# SeleniumMiddleware 中間件
'mySpider.middlewares.SeleniumMiddleware': 543,
# 將scrapy默認(rèn)的user-agent中間件關(guān)閉
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
},
#.....................華麗的分割線.......................
# 生成request時(shí),將是否使用selenium下載的標(biāo)記,放入到meta中
yield Request(
url = "https://www.amazon.com/",
meta = {'usedSelenium': True, 'dont_redirect': True},
callback = self.parseIndexPage,
errback = self.error
)
在下載中間件middlewares.py中,使用selenium抓取頁(yè)面(核心部分)
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from scrapy.http import HtmlResponse
from logging import getLogger
import time
class SeleniumMiddleware():
# 經(jīng)常需要在pipeline或者中間件中獲取settings的屬性,可以通過(guò)scrapy.crawler.Crawler.settings屬性
@classmethod
def from_crawler(cls, crawler):
# 從settings.py中,提取selenium設(shè)置參數(shù),初始化類
return cls(timeout=crawler.settings.get('SELENIUM_TIMEOUT'),
isLoadImage=crawler.settings.get('LOAD_IMAGE'),
windowHeight=crawler.settings.get('WINDOW_HEIGHT'),
windowWidth=crawler.settings.get('WINDOW_WIDTH')
)
def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
self.logger = getLogger(__name__)
self.timeout = timeout
self.isLoadImage = isLoadImage
# 定義一個(gè)屬于這個(gè)類的browser,防止每次請(qǐng)求頁(yè)面時(shí),都會(huì)打開(kāi)一個(gè)新的chrome瀏覽器
# 這樣,這個(gè)類處理的Request都可以只用這一個(gè)browser
self.browser = webdriver.Chrome()
if windowHeight and windowWidth:
self.browser.set_window_size(900, 900)
self.browser.set_page_load_timeout(self.timeout) # 頁(yè)面加載超時(shí)時(shí)間
self.wait = WebDriverWait(self.browser, 25) # 指定元素加載超時(shí)時(shí)間
def process_request(self, request, spider):
'''
用chrome抓取頁(yè)面
:param request: Request請(qǐng)求對(duì)象
:param spider: Spider對(duì)象
:return: HtmlResponse響應(yīng)
'''
# self.logger.debug('chrome is getting page')
print(f"chrome is getting page")
# 依靠meta中的標(biāo)記,來(lái)決定是否需要使用selenium來(lái)爬取
usedSelenium = request.meta.get('usedSelenium', False)
if usedSelenium:
try:
self.browser.get(request.url)
# 搜索框是否出現(xiàn)
input = self.wait.until(
EC.presence_of_element_located((By.XPATH, "http://div[@class='nav-search-field ']/input"))
)
time.sleep(2)
input.clear()
input.send_keys("iphone 7s")
# 敲enter鍵, 進(jìn)行搜索
input.send_keys(Keys.RETURN)
# 查看搜索結(jié)果是否出現(xiàn)
searchRes = self.wait.until(
EC.presence_of_element_located((By.XPATH, "http://div[@id='resultsCol']"))
)
except Exception as e:
# self.logger.debug(f'chrome getting page error, Exception = {e}')
print(f"chrome getting page error, Exception = {e}")
return HtmlResponse(url=request.url, status=500, request=request)
else:
time.sleep(3)
return HtmlResponse(url=request.url,
body=self.browser.page_source,
request=request,
# 最好根據(jù)網(wǎng)頁(yè)的具體編碼而定
encoding='utf-8',
status=200)
5. 執(zhí)行結(jié)果

6. 存在的問(wèn)題
6.1. Spider關(guān)閉了,chrome沒(méi)有退出。
2018-04-04 09:26:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 2092766,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 4, 1, 26, 16, 763602),
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 4, 4, 1, 25, 48, 301602)}
2018-04-04 09:26:18 [scrapy.core.engine] INFO: Spider closed (finished)
上面,我們是把browser對(duì)象放到了Middleware中間件中,只能做process_request和process_response, 沒(méi)有說(shuō)在中間件中介紹如何調(diào)用scrapy的close方法。
解決方案:利用信號(hào)量的方式,當(dāng)收到spider_closed信號(hào)時(shí),調(diào)用browser.quit()
6.2. 當(dāng)一個(gè)項(xiàng)目同時(shí)啟動(dòng)多個(gè)spider,會(huì)共用到Middleware中的selenium,不利于并發(fā)。
因?yàn)橛胹crapy + selenium的方式,只有部分,甚至是一小部分頁(yè)面會(huì)用到chrome,既然把chrome放到Middleware中有這么多限制,那為什么不能把chrome放到spider里面呢。這樣的好處在于:每個(gè)spider都有自己的chrome,這樣當(dāng)啟動(dòng)多個(gè)spider時(shí),就會(huì)有多個(gè)chrome,不是所有的spider共用一個(gè)chrome,這對(duì)我們的并發(fā)是有好處的。
解決方案:將chrome的初始化放到spider中,每個(gè)spider獨(dú)占自己的chrome
7. 改進(jìn)版代碼
在settings.py中,配置好selenium參數(shù):
# 文件settings.py中 # ----------- selenium參數(shù)配置 ------------- SELENIUM_TIMEOUT = 25 # selenium瀏覽器的超時(shí)時(shí)間,單位秒 LOAD_IMAGE = True # 是否下載圖片 WINDOW_HEIGHT = 900 # 瀏覽器窗口大小 WINDOW_WIDTH = 900
在spider中,生成request時(shí),標(biāo)記哪些請(qǐng)求需要走selenium下載:
# 文件mySpider.py中
# selenium相關(guān)庫(kù)
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# scrapy 信號(hào)相關(guān)庫(kù)
from scrapy.utils.project import get_project_settings
# 下面這種方式,即將廢棄,所以不用
# from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
# scrapy最新采用的方案
from pydispatch import dispatcher
class mySpider(CrawlSpider):
name = "mySpiderAmazon"
allowed_domains = ['amazon.com']
custom_settings = {
'LOG_LEVEL':'INFO',
'DOWNLOAD_DELAY': 0,
'COOKIES_ENABLED': False, # enabled by default
'DOWNLOADER_MIDDLEWARES': {
# 代理中間件
'mySpider.middlewares.ProxiesMiddleware': 400,
# SeleniumMiddleware 中間件
'mySpider.middlewares.SeleniumMiddleware': 543,
# 將scrapy默認(rèn)的user-agent中間件關(guān)閉
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
},
# 將chrome初始化放到spider中,成為spider中的元素
def __init__(self, timeout=30, isLoadImage=True, windowHeight=None, windowWidth=None):
# 從settings.py中獲取設(shè)置參數(shù)
self.mySetting = get_project_settings()
self.timeout = self.mySetting['SELENIUM_TIMEOUT']
self.isLoadImage = self.mySetting['LOAD_IMAGE']
self.windowHeight = self.mySetting['WINDOW_HEIGHT']
self.windowWidth = self.mySetting['windowWidth']
# 初始化chrome對(duì)象
self.browser = webdriver.Chrome()
if self.windowHeight and self.windowWidth:
self.browser.set_window_size(900, 900)
self.browser.set_page_load_timeout(self.timeout) # 頁(yè)面加載超時(shí)時(shí)間
self.wait = WebDriverWait(self.browser, 25) # 指定元素加載超時(shí)時(shí)間
super(mySpider, self).__init__()
# 設(shè)置信號(hào)量,當(dāng)收到spider_closed信號(hào)時(shí),調(diào)用mySpiderCloseHandle方法,關(guān)閉chrome
dispatcher.connect(receiver = self.mySpiderCloseHandle,
signal = signals.spider_closed
)
# 信號(hào)量處理函數(shù):關(guān)閉chrome瀏覽器
def mySpiderCloseHandle(self, spider):
print(f"mySpiderCloseHandle: enter ")
self.browser.quit()
#.....................華麗的分割線.......................
# 生成request時(shí),將是否使用selenium下載的標(biāo)記,放入到meta中
yield Request(
url = "https://www.amazon.com/",
meta = {'usedSelenium': True, 'dont_redirect': True},
callback = self.parseIndexPage,
errback = self.error
)
在下載中間件middlewares.py中,使用selenium抓取頁(yè)面
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from scrapy.http import HtmlResponse
from logging import getLogger
import time
class SeleniumMiddleware():
# Middleware中會(huì)傳遞進(jìn)來(lái)一個(gè)spider,這就是我們的spider對(duì)象,從中可以獲取__init__時(shí)的chrome相關(guān)元素
def process_request(self, request, spider):
'''
用chrome抓取頁(yè)面
:param request: Request請(qǐng)求對(duì)象
:param spider: Spider對(duì)象
:return: HtmlResponse響應(yīng)
'''
print(f"chrome is getting page")
# 依靠meta中的標(biāo)記,來(lái)決定是否需要使用selenium來(lái)爬取
usedSelenium = request.meta.get('usedSelenium', False)
if usedSelenium:
try:
spider.browser.get(request.url)
# 搜索框是否出現(xiàn)
input = spider.wait.until(
EC.presence_of_element_located((By.XPATH, "http://div[@class='nav-search-field ']/input"))
)
time.sleep(2)
input.clear()
input.send_keys("iphone 7s")
# 敲enter鍵, 進(jìn)行搜索
input.send_keys(Keys.RETURN)
# 查看搜索結(jié)果是否出現(xiàn)
searchRes = spider.wait.until(
EC.presence_of_element_located((By.XPATH, "http://div[@id='resultsCol']"))
)
except Exception as e:
print(f"chrome getting page error, Exception = {e}")
return HtmlResponse(url=request.url, status=500, request=request)
else:
time.sleep(3)
# 頁(yè)面爬取成功,構(gòu)造一個(gè)成功的Response對(duì)象(HtmlResponse是它的子類)
return HtmlResponse(url=request.url,
body=spider.browser.page_source,
request=request,
# 最好根據(jù)網(wǎng)頁(yè)的具體編碼而定
encoding='utf-8',
status=200)
運(yùn)行結(jié)果(spider結(jié)束,執(zhí)行mySpiderCloseHandle關(guān)閉chrome瀏覽器):
['categorySelectorAmazon1.pipelines.MongoPipeline']
2018-04-04 11:56:21 [scrapy.core.engine] INFO: Spider opened
2018-04-04 11:56:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
chrome is getting page
parseProductDetail url = https://www.amazon.com/, status = 200, meta = {'usedSelenium': True, 'dont_redirect': True, 'download_timeout': 25.0, 'proxy': 'http://H37XPSB6V57VU96D:CAB31DAEB9313CE5@proxy.abuyun.com:9020', 'depth': 0}
chrome is getting page
2018-04-04 11:56:54 [scrapy.core.engine] INFO: Closing spider (finished)
mySpiderCloseHandle: enter
2018-04-04 11:56:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/response_bytes': 1938619,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 4, 3, 56, 54, 301602),
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 4, 4, 3, 56, 21, 642602)}
2018-04-04 11:56:59 [scrapy.core.engine] INFO: Spider closed (finished)
到此這篇關(guān)于如何在scrapy中集成selenium爬取網(wǎng)頁(yè)的方法的文章就介紹到這了,更多相關(guān)scrapy集成selenium爬取網(wǎng)頁(yè)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
python 簡(jiǎn)易計(jì)算器程序,代碼就幾行
運(yùn)行環(huán)境:python 3.1,代碼比較短,大家可以參考下。2009-08-08
Python基于遞歸實(shí)現(xiàn)電話號(hào)碼映射功能示例
這篇文章主要介紹了Python基于遞歸實(shí)現(xiàn)電話號(hào)碼映射功能,結(jié)合實(shí)例形式分析了Python針對(duì)字典的遞歸、遍歷相關(guān)操作技巧,需要的朋友可以參考下2018-04-04
關(guān)于python中密碼加鹽的學(xué)習(xí)體會(huì)小結(jié)
這篇文章主要介紹了關(guān)于python中密碼加鹽的學(xué)習(xí)體會(huì)小結(jié),文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2019-07-07
python txt中的文件,逐行讀取并且每行賦值給變量問(wèn)題
這篇文章主要介紹了python txt中的文件,逐行讀取并且每行賦值給變量問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。2023-02-02
python用BeautifulSoup庫(kù)簡(jiǎn)單爬蟲(chóng)實(shí)例分析
文章給大家分享了關(guān)于python爬蟲(chóng)的相關(guān)實(shí)例以及相關(guān)代碼,有興趣的朋友們參考下。2018-07-07
Python 超時(shí)請(qǐng)求或計(jì)算的處理方案
這篇文章主要介紹了Python 超時(shí)請(qǐng)求或計(jì)算的處理方案,本文給大家介紹的非常詳細(xì),感興趣的朋友跟隨小編一起看看吧2024-06-06

