Scrapy爬蟲Response子類在應(yīng)用中的問題解析
正文
今天用scrapy爬取壁紙的時候(url:http://pic.netbian.com/4kmein...)絮叨了一些問題,記錄下來,供后世探討,以史為鑒。**
因為網(wǎng)站是動態(tài)渲染的,所以選擇scrapy對接selenium(scrapy抓取網(wǎng)頁的方式和requests庫相似,都是直接模擬HTTP請求,而Scrapy也不能抓取JavaScript動態(tài)渲染的網(wǎng)頁。)
所以在Downloader Middlewares中需要得到Request并且返回一個Response,問題出在Response,通過查看官方文檔發(fā)現(xiàn)class scrapy.http.Response(url[, status=200, headers=None, body=b'', flags=None, request=None]),隨即通過from scrapy.http import Response導(dǎo)入Response

輸入scrapy crawl girl得到如下錯誤:
*results=response.xpath('//[@id="main"]/div[3]/ul/lia/img')
raise NotSupported("Response content isn't text")
scrapy.exceptions.NotSupported: Response content isn't text**
檢查相關(guān)代碼:
# middlewares.py
from scrapy import signals
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
import selenium
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
class Pic4KgirlDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
try:
self.browser=selenium.webdriver.Chrome()
self.wait=WebDriverWait(self.browser,10)
self.browser.get(request.url)
self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#main > div.page > a:nth-child(10)')))
return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
#except:
#raise IgnoreRequest()
finally:
self.browser.close()推斷問題出在:
return Response(url=request.url,status=200,request=request,body=self.browser.page_source.encode('utf-8'))
查看Response類的定義
@property
def text(self):
"""For subclasses of TextResponse, this will return the body
as text (unicode object in Python 2 and str in Python 3)
"""
raise AttributeError("Response content isn't text")
def css(self, *a, **kw):
"""Shortcut method implemented only by responses whose content
is text (subclasses of TextResponse).
"""
raise NotSupported("Response content isn't text")
def xpath(self, *a, **kw):
"""Shortcut method implemented only by responses whose content
is text (subclasses of TextResponse).
"""
raise NotSupported("Response content isn't text")說明Response類不可以被直接使用,需要被繼承重寫方法后才能使用
響應(yīng)子類
**TextResponse對象** class scrapy.http.TextResponse(url[, encoding[, ...]]) **HtmlResponse對象** class scrapy.http.HtmlResponse(url[, ...]) **XmlResponse對象** class scrapy.http.XmlResponse(url [,... ] )
舉例觀察TextResponse的定義from scrapy.http import TextResponse
導(dǎo)入TextResponse發(fā)現(xiàn)
class TextResponse(Response):
_DEFAULT_ENCODING = 'ascii'
def __init__(self, *args, **kwargs):
self._encoding = kwargs.pop('encoding', None)
self._cached_benc = None
self._cached_ubody = None
self._cached_selector = None
super(TextResponse, self).__init__(*args, **kwargs)其中xpath方法已經(jīng)被重寫
@property
def selector(self):
from scrapy.selector import Selector
if self._cached_selector is None:
self._cached_selector = Selector(self)
return self._cached_selector
def xpath(self, query, **kwargs):
return self.selector.xpath(query, **kwargs)
def css(self, query):
return self.selector.css(query)所以用戶想要調(diào)用Response類,必須選擇調(diào)用其子類,并且重寫部分方法
Scrapy爬蟲入門教程十一 Request和Response(請求和響應(yīng))
scrapy文檔:https://doc.scrapy.org/en/lat...
中文翻譯文檔:http://www.dbjr.com.cn/article/248161.htm
以上就是Scrapy爬蟲Response子類在應(yīng)用中的問題解析的詳細內(nèi)容,更多關(guān)于Scrapy爬蟲Response子類應(yīng)用的資料請關(guān)注腳本之家其它相關(guān)文章!
相關(guān)文章
Python數(shù)學(xué)建模PuLP庫線性規(guī)劃實際案例編程詳解
本節(jié)以一個實際數(shù)學(xué)建模案例,來為大家講解PuLP求解線性規(guī)劃問題的建模與編程。來鞏固加深大家對Python數(shù)學(xué)建模PuLP庫線性規(guī)劃的運用理解2021-10-10
python 6.7 編寫printTable()函數(shù)表格打印(完整代碼)
這篇文章主要介紹了python 6.7 編寫一個名為printTable()的函數(shù) 表格打印,本文通過實例代碼給大家介紹的非常詳細,對大家的學(xué)習(xí)或工作具有一定的參考借鑒價值,需要的朋友可以參考下2020-03-03
python tkinter 設(shè)置窗口大小不可縮放實例
這篇文章主要介紹了python tkinter 設(shè)置窗口大小不可縮放實例,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2020-03-03
詳解Python使用apscheduler定時執(zhí)行任務(wù)
在平常的工作中幾乎有一半的功能模塊都需要定時任務(wù)來推動,例如項目中有一個定時統(tǒng)計程序,定時爬出網(wǎng)站的URL程序,定時檢測釣魚網(wǎng)站的程序等等,都涉及到了關(guān)于定時任務(wù)的問題,所以就找到了python的定時任務(wù)模塊2022-03-03
PyQt5 實現(xiàn)字體大小自適應(yīng)分辨率的方法
今天小編就為大家分享一篇PyQt5 實現(xiàn)字體大小自適應(yīng)分辨率的方法,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2019-06-06

