Scrapy-Redis結(jié)合POST請(qǐng)求獲取數(shù)據(jù)的方法示例
前言
通常我們?cè)谝粋€(gè)站站點(diǎn)進(jìn)行采集的時(shí)候,如果是小站的話 我們使用scrapy本身就可以滿(mǎn)足。
但是如果在面對(duì)一些比較大型的站點(diǎn)的時(shí)候,單個(gè)scrapy就顯得力不從心了。
要是我們能夠多個(gè)Scrapy一起采集該多好啊 人多力量大。
很遺憾Scrapy官方并不支持多個(gè)同時(shí)采集一個(gè)站點(diǎn),雖然官方給出一個(gè)方法:
**將一個(gè)站點(diǎn)的分割成幾部分 交給不同的scrapy去采集**
似乎是個(gè)解決辦法,但是很麻煩誒!畢竟分割很麻煩的哇
下面就改輪到我們的額主角Scrapy-Redis登場(chǎng)了!
能看到這篇文章的小伙伴肯定已經(jīng)知道什么是Scrapy以及Scrapy-Redis了,基礎(chǔ)概念這里就不再介紹。默認(rèn)情況下Scrapy-Redis是發(fā)送GET請(qǐng)求獲取數(shù)據(jù)的,對(duì)于某些使用POST請(qǐng)求的情況需要重寫(xiě)make_request_from_data函數(shù)即可,但奇怪的是居然沒(méi)在網(wǎng)上搜到簡(jiǎn)潔明了的答案,或許是太簡(jiǎn)單了?。
這里我以httpbin.org這個(gè)網(wǎng)站為例,首先在settings.py中添加所需配置,這里需要根據(jù)實(shí)際情況進(jìn)行修改:
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #啟用Redis調(diào)度存儲(chǔ)請(qǐng)求隊(duì)列 SCHEDULER_PERSIST = True #不清除Redis隊(duì)列、這樣可以暫停/恢復(fù) 爬取 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #確保所有的爬蟲(chóng)通過(guò)Redis去重 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue' REDIS_URL = "redis://127.0.0.1:6379"
爬蟲(chóng)代碼如下:
# -*- coding: utf-8 -*- import scrapy from scrapy_redis.spiders import RedisSpider class HpbSpider(RedisSpider): name = 'hpb' redis_key = 'test_post_data' def make_request_from_data(self, data): """Returns a Request instance from data coming from Redis. By default, ``data`` is an encoded URL. You can override this method to provide your own message decoding. Parameters ---------- data : bytes Message from redis. """ return scrapy.FormRequest("https://www.httpbin.org/post", formdata={"data":data},callback=self.parse) def parse(self, response): print(response.body)
這里為了簡(jiǎn)單直接進(jìn)行輸出,真實(shí)使用時(shí)可以結(jié)合pipeline寫(xiě)數(shù)據(jù)庫(kù)等。
然后啟動(dòng)爬蟲(chóng)程序scrapy crawl hpb,由于我們還沒(méi)向test_post_data中寫(xiě)數(shù)據(jù),所以啟動(dòng)后程序進(jìn)入等待狀態(tài)。然后模擬向隊(duì)列寫(xiě)數(shù)據(jù):
import redis rd = redis.Redis('127.0.0.1',port=6379,db=0) for _ in range(1000): rd.lpush('test_post_data',_)
此時(shí)可以看到爬蟲(chóng)已經(jīng)開(kāi)始獲取程序了:
2019-05-06 16:30:21 [hpb] DEBUG: Read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "0"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "1"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "3"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "2"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "4"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "5"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "6"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "data": "7"\n }, \n "headers": {\n "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n "Accept-Encoding": "gzip,deflate", \n "Accept-Language": "en", \n "Content-Length": "6", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "
2019-05-06 16:31:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
至于數(shù)據(jù)重復(fù)的問(wèn)題,如果POST的數(shù)據(jù)重復(fù),這個(gè)請(qǐng)求就不會(huì)發(fā)送出去。如果有特殊情況POST發(fā)送同樣的數(shù)據(jù)回得到不同返回值,添加dont_filter=True是沒(méi)用的,在RFPDupeFilter類(lèi)中并沒(méi)考慮這個(gè)參數(shù),需要重寫(xiě)。
總結(jié)
以上就是這篇文章的全部?jī)?nèi)容了,希望本文的內(nèi)容對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,謝謝大家對(duì)腳本之家的支持。
相關(guān)文章
Python Tornado 實(shí)現(xiàn)SSE服務(wù)端主動(dòng)推送方案
SSE是Server-Sent Events 的簡(jiǎn)稱(chēng),是一種服務(wù)器端到客戶(hù)端(瀏覽器)的單項(xiàng)消息推送,本文主要探索兩個(gè)方面的實(shí)踐一個(gè)是客戶(hù)端發(fā)送請(qǐng)求,服務(wù)端的返回是分多次進(jìn)行傳輸?shù)?直到傳輸完成,這種情況下請(qǐng)求結(jié)束后,考慮關(guān)閉SSE,所以這種連接可以認(rèn)為是暫時(shí)的,感興趣的朋友一起看看吧2024-01-01Python實(shí)現(xiàn)TXT數(shù)據(jù)轉(zhuǎn)三維矩陣
在數(shù)據(jù)處理和分析中,將文本文件中的數(shù)據(jù)轉(zhuǎn)換為三維矩陣是一個(gè)常見(jiàn)的任務(wù),本文將詳細(xì)介紹如何使用Python實(shí)現(xiàn)這一任務(wù),感興趣的小伙伴可以了解下2024-01-01Python實(shí)現(xiàn)的微信好友數(shù)據(jù)分析功能示例
這篇文章主要介紹了Python實(shí)現(xiàn)的微信好友數(shù)據(jù)分析功能,結(jié)合實(shí)例形式分析了Python使用itchat、pandas、pyecharts等模塊針對(duì)微信好友數(shù)據(jù)進(jìn)行統(tǒng)計(jì)與計(jì)算相關(guān)操作技巧,需要的朋友可以參考下2018-06-06Python專(zhuān)用方法與迭代機(jī)制實(shí)例分析
這篇文章主要介紹了Python專(zhuān)用方法與迭代機(jī)制,包括類(lèi)的私有方法、專(zhuān)有方法、模塊私有對(duì)象、迭代__iter__()方法的對(duì)象等,需要的朋友可以參考下2014-09-09Python抓取移動(dòng)App數(shù)據(jù)使用mitmweb監(jiān)聽(tīng)請(qǐng)求與響應(yīng)
這篇文章主要介紹了Python抓取移動(dòng)App數(shù)據(jù)使用mitmweb監(jiān)聽(tīng)請(qǐng)求與響應(yīng),mitmproxy控制臺(tái)方式、mitmdump與Python對(duì)接的方式、mitmweb可視化方式,需要的朋友可以參考一下2022-01-01Python將DataFrame的某一列作為index的方法
下面小編就為大家分享一篇Python將DataFrame的某一列作為index的方法,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2018-04-04plt.figure()參數(shù)使用詳解及運(yùn)行演示
這篇文章主要介紹了plt.figure()參數(shù)使用詳解及運(yùn)行演示,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2021-01-01