Pyspider中給爬蟲偽造隨機(jī)請求頭的實(shí)例
Pyspider 中采用了 tornado 庫來做 http 請求,在請求過程中可以添加各種參數(shù),例如請求鏈接超時(shí)時(shí)間,請求傳輸數(shù)據(jù)超時(shí)時(shí)間,請求頭等等,但是根據(jù)pyspider的原始框架,給爬蟲添加參數(shù)只能通過 crawl_config這個(gè)Python字典來完成(如下所示),框架代碼將這個(gè)字典中的參數(shù)轉(zhuǎn)換成 task 數(shù)據(jù),進(jìn)行http請求。這個(gè)參數(shù)的缺點(diǎn)是不方便給每一次請求做隨機(jī)請求頭。
crawl_config = {
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"timeout": 120,
"connect_timeout": 60,
"retries": 5,
"fetch_type": 'js',
"auto_recrawl": True,
}
這里寫出給爬蟲添加隨機(jī)請求頭的方法:
1、編寫腳本,將腳本放置在 pyspider 的 libs 文件夾下,命名為 header_switch.py
#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Created on 2017-10-18 11:52:26
import random
import time
class HeadersSelector(object):
"""
Header 中缺少幾個(gè)字段 Host 和 Cookie
"""
headers_1 = {
"Proxy-Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"DNT": "1",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",
"Referer": "https://www.baidu.com/s?wd=%BC%96%E7%A0%81&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=0&oq=If-None-Match&inputT=7282&rsv_t",
"Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
} # 網(wǎng)上找的瀏覽器
headers_2 = {
"Proxy-Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
"Accept": "image/gif,image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
"DNT": "1",
"Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnPAvZN",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4",
} # window 7 系統(tǒng)瀏覽器
headers_3 = {
"Proxy-Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0",
"Accept": "image/x-xbitmap,image/jpeg,application/x-shockwave-flash,application/vnd.ms-excel,application/vnd.ms-powerpoint,application/msword,*/*",
"DNT": "1",
"Referer": "https://www.baidu.com/s?wd=http%B4%20Pragma&rsf=1&rsp=4&f=1&oq=Pragma&tn=baiduhome_pg&ie=utf-8&usm=3&rsv_idx=2&rsv_pq=e9bd5e5000010",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.8,en-US;q=0.7,en;q=0.6",
} # Linux 系統(tǒng) firefox 瀏覽器
headers_4 = {
"Proxy-Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0",
"Accept": "*/*",
"DNT": "1",
"Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-ZTFnP",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",
} # Win10 系統(tǒng) firefox 瀏覽器
headers_5 = {
"Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64;) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36 Edge/15.15063",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Referer": "https://www.baidu.com/link?url=c-FMHf06-ZPhoRM4tWduhraKXhnSm_RzjXZ-",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.7,en;q=0.6",
"Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
} # Win10 系統(tǒng) Chrome 瀏覽器
headers_6 = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, sdch",
"Accept-Language": "zh-CN,zh;q=0.8",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"DNT": "1",
"Referer": "https://www.baidu.com/s?wd=If-None-Match&rsv_spt=1&rsv_iqid=0x9fcbc99a0000b5d7&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rq",
"Accept-Charset": "gb2312,gbk;q=0.7,utf-8;q=0.7,*;q=0.7",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0",
} # win10 系統(tǒng)瀏覽器
def __init__(self):
pass
def select_header(self):
n = random.randint(1, 6)
switch={
1: self.headers_1
2: self.headers_2
3: self.headers_3
4: self.headers_4
5: self.headers_5
6: self.headers_6
}
headers = switch[n]
return headers
其中,我只寫了6個(gè)請求頭,如果爬蟲的量非常大,完全可以寫更多的請求頭,甚至上百個(gè),然后將 random的隨機(jī)范圍擴(kuò)大,進(jìn)行選擇。
2、在pyspider 腳本中編寫如下代碼:
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2017-08-18 11:52:26
from pyspider.libs.base_handler import *
from pyspider.addings.headers_switch import HeadersSelector
import sys
defaultencoding = 'utf-8'
if sys.getdefaultencoding() != defaultencoding:
reload(sys)
sys.setdefaultencoding(defaultencoding)
class Handler(BaseHandler):
crawl_config = {
"user_agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
"timeout": 120,
"connect_timeout": 60,
"retries": 5,
"fetch_type": 'js',
"auto_recrawl": True,
}
@every(minutes=24 * 60)
def on_start(self):
header_slt = HeadersSelector()
header = header_slt.select_header() # 獲取一個(gè)新的 header
# header["X-Requested-With"] = "XMLHttpRequest"
orig_
self.crawl(orig_href,
callback=self.index_page,
headers=header) # 請求頭必須寫在 crawl 里,cookies 從 response.cookies 中找
@config(age=24 * 60 * 60)
def index_page(self, response):
header_slt = HeadersSelector()
header = header_slt.select_header() # 獲取一個(gè)新的 header
# header["X-Requested-With"] = "XMLHttpRequest"
if response.cookies:
header["Cookies"] = response.cookies
其中最重要的就是在每個(gè)回調(diào)函數(shù) on_start,index_page 等等 當(dāng)中,每次調(diào)用時(shí),都會實(shí)例化一個(gè) header 選擇器,給每一次請求添加不一樣的 header。要注意添加的如下代碼:
header_slt = HeadersSelector()
header = header_slt.select_header() # 獲取一個(gè)新的 header
# header["X-Requested-With"] = "XMLHttpRequest"
header["Host"] = "www.baidu.com"
if response.cookies:
header["Cookies"] = response.cookies
當(dāng)使用 XHR 發(fā)送 AJAX 請求時(shí)會帶上 Header,常被用來判斷是不是 Ajax 請求, headers 要添加 {‘X-Requested-With': ‘XMLHttpRequest'} 才能抓取到內(nèi)容。
確定了 url 也就確定了請求頭中的 Host,需要按需添加,urlparse包里給出了根據(jù) url解析出 host的方法函數(shù),直接調(diào)用netloc即可。
如果響應(yīng)中有 cookie,就需要將 cookie 添加到請求頭中。
如果還有別的偽裝需求,自行添加。
如此即可實(shí)現(xiàn)隨機(jī)請求頭,完。
以上這篇Pyspider中給爬蟲偽造隨機(jī)請求頭的實(shí)例就是小編分享給大家的全部內(nèi)容了,希望能給大家一個(gè)參考,也希望大家多多支持腳本之家。
相關(guān)文章
詳解python __init__.py 和 __all__作用
導(dǎo)入文件夾包的時(shí)候,會運(yùn)行寫在該文件夾包下的__init__.py文件,這主要是__init__.py的作用,本文結(jié)合示例代碼介紹了python __init__.py 和 __all__作用,感興趣的朋友一起看看吧2023-02-02
python實(shí)現(xiàn)word文檔批量轉(zhuǎn)成自定義格式的excel文檔的思路及實(shí)例代碼
這篇文章主要介紹了python實(shí)現(xiàn)word文檔批量轉(zhuǎn)成自定義格式的excel文檔的解決思路及實(shí)例代碼,代碼簡單易懂,非常不錯(cuò),具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2020-02-02
Python實(shí)現(xiàn)批量獲取當(dāng)前文件夾下的文件名
這篇文章主要為大家詳細(xì)介紹了如何利用Python實(shí)現(xiàn)批量獲取當(dāng)前文件夾下的文件名,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下2024-02-02
Python TCP接收數(shù)據(jù)不全的問題解決
本文主要介紹了Python TCP接收數(shù)據(jù)不全的問題解決,文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2023-07-07
Python使用itchat模塊實(shí)現(xiàn)群聊轉(zhuǎn)發(fā),自動回復(fù)功能示例
這篇文章主要介紹了Python使用itchat模塊實(shí)現(xiàn)群聊轉(zhuǎn)發(fā),自動回復(fù)功能,結(jié)合實(shí)例形式分析了Python基于itchat模塊針對微信信息的發(fā)送、回復(fù)等相關(guān)操作技巧,需要的朋友可以參考下2019-08-08
Python打包exe時(shí)各種異常處理方案總結(jié)
今天教大家用Python打包exe時(shí)各種異常處理的方案總結(jié),下文中有非常詳細(xì)的介紹,對正在學(xué)習(xí)python的小伙伴們很有幫助喲,需要的朋友可以參考下2021-05-05
通過python的matplotlib包將Tensorflow數(shù)據(jù)進(jìn)行可視化的方法
今天小編就為大家分享一篇通過python的matplotlib包將Tensorflow數(shù)據(jù)進(jìn)行可視化的方法,具有很好的參考價(jià)值,希望對大家有所幫助。一起跟隨小編過來看看吧2019-01-01
Python鏈?zhǔn)秸{(diào)用數(shù)據(jù)處理實(shí)際應(yīng)用實(shí)例探究
本文將深入介紹Python鏈?zhǔn)秸{(diào)用的概念、原理以及實(shí)際應(yīng)用,通過豐富的示例代碼,幫助讀者更全面地理解和應(yīng)用這一編程技巧2024-01-01

