快捷導(dǎo)航

Scrapy-Redis結(jié)合POST請(qǐng)求獲取數(shù)據(jù)的方法示例

更新時(shí)間：2019年05月07日 10:25:46 作者：Hi!Roy!

這篇文章主要給大家介紹了關(guān)于Scrapy-Redis結(jié)合POST請(qǐng)求獲取數(shù)據(jù)的相關(guān)資料，文中通過示例代碼介紹的非常詳細(xì)，對(duì)大家學(xué)習(xí)或者使用Scrapy-Redis具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面來一起學(xué)習(xí)學(xué)習(xí)吧

前言

通常我們?cè)谝粋€(gè)站站點(diǎn)進(jìn)行采集的時(shí)候，如果是小站的話我們使用scrapy本身就可以滿足。

但是如果在面對(duì)一些比較大型的站點(diǎn)的時(shí)候，單個(gè)scrapy就顯得力不從心了。

要是我們能夠多個(gè)Scrapy一起采集該多好啊人多力量大。

很遺憾Scrapy官方并不支持多個(gè)同時(shí)采集一個(gè)站點(diǎn)，雖然官方給出一個(gè)方法：

**將一個(gè)站點(diǎn)的分割成幾部分交給不同的scrapy去采集**

似乎是個(gè)解決辦法，但是很麻煩誒！畢竟分割很麻煩的哇

下面就改輪到我們的額主角Scrapy-Redis登場(chǎng)了！

能看到這篇文章的小伙伴肯定已經(jīng)知道什么是Scrapy以及Scrapy-Redis了，基礎(chǔ)概念這里就不再介紹。默認(rèn)情況下Scrapy-Redis是發(fā)送GET請(qǐng)求獲取數(shù)據(jù)的，對(duì)于某些使用POST請(qǐng)求的情況需要重寫make_request_from_data函數(shù)即可，但奇怪的是居然沒在網(wǎng)上搜到簡(jiǎn)潔明了的答案，或許是太簡(jiǎn)單了？。

這里我以httpbin.org這個(gè)網(wǎng)站為例，首先在settings.py中添加所需配置，這里需要根據(jù)實(shí)際情況進(jìn)行修改：

SCHEDULER = "scrapy_redis.scheduler.Scheduler" #啟用Redis調(diào)度存儲(chǔ)請(qǐng)求隊(duì)列
SCHEDULER_PERSIST = True #不清除Redis隊(duì)列、這樣可以暫停/恢復(fù) 爬取
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #確保所有的爬蟲通過Redis去重
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.SpiderPriorityQueue'
REDIS_URL = "redis://127.0.0.1:6379"

爬蟲代碼如下：

# -*- coding: utf-8 -*-
import scrapy
from scrapy_redis.spiders import RedisSpider


class HpbSpider(RedisSpider):
 name = 'hpb'
 redis_key = 'test_post_data'

 def make_request_from_data(self, data):
  """Returns a Request instance from data coming from Redis.
  By default, ``data`` is an encoded URL. You can override this method to
  provide your own message decoding.
  Parameters
  ----------
  data : bytes
   Message from redis.
  """
  return scrapy.FormRequest("https://www.httpbin.org/post",
         formdata={"data":data},callback=self.parse)

 def parse(self, response):
  print(response.body)

這里為了簡(jiǎn)單直接進(jìn)行輸出，真實(shí)使用時(shí)可以結(jié)合pipeline寫數(shù)據(jù)庫等。

然后啟動(dòng)爬蟲程序scrapy crawl hpb，由于我們還沒向test_post_data中寫數(shù)據(jù)，所以啟動(dòng)后程序進(jìn)入等待狀態(tài)。然后模擬向隊(duì)列寫數(shù)據(jù)：

import redis
rd = redis.Redis('127.0.0.1',port=6379,db=0)
for _ in range(1000):
 rd.lpush('test_post_data',_)

此時(shí)可以看到爬蟲已經(jīng)開始獲取程序了：

2019-05-06 16:30:21 [hpb] DEBUG: Read 8 requests from 'test_post_data'
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
2019-05-06 16:30:21 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.httpbin.org/post> (referer: None)
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "0"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "1"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "3"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "2"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "4"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "5"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "6"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
b'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n    "data": "7"\n }, \n "headers": {\n    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", \n    "Accept-Encoding": "gzip,deflate", \n    "Accept-Language": "en", \n    "Content-Length": "6", \n    "Content-Type": "application/x-www-form-urlencoded", \n    "Host": "
2019-05-06 16:31:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 280 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:32:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-05-06 16:33:09 [scrapy.extensions.logstats] INFO: Crawled 1001 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

至于數(shù)據(jù)重復(fù)的問題，如果POST的數(shù)據(jù)重復(fù)，這個(gè)請(qǐng)求就不會(huì)發(fā)送出去。如果有特殊情況POST發(fā)送同樣的數(shù)據(jù)回得到不同返回值，添加dont_filter=True是沒用的，在RFPDupeFilter類中并沒考慮這個(gè)參數(shù)，需要重寫。

總結(jié)

以上就是這篇文章的全部?jī)?nèi)容了，希望本文的內(nèi)容對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，謝謝大家對(duì)腳本之家的支持。