講解Python的Scrapy爬蟲框架使用代理進行采集的方法
1.在Scrapy工程下新建“middlewares.py”
# Importing base64 library because we'll need it ONLY in case if the proxy we are going to use requires authentication import base64 # Start your middleware class class ProxyMiddleware(object): # overwrite process request def process_request(self, request, spider): # Set the location of the proxy request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT" # Use the following lines if your proxy requires authentication proxy_user_pass = "USERNAME:PASSWORD" # setup basic authentication for the proxy encoded_user_pass = base64.encodestring(proxy_user_pass) request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
2.在項目配置文件里(./project_name/settings.py)添加
DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 'project_name.middlewares.ProxyMiddleware': 100, }
只要兩步,現(xiàn)在請求就是通過代理的了。測試一下^_^
from scrapy.spider import BaseSpider from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.http import Request class TestSpider(CrawlSpider): name = "test" domain_name = "whatismyip.com" # The following url is subject to change, you can get the last updated one from here : # http://www.whatismyip.com/faq/automation.asp start_urls = ["http://xujian.info"] def parse(self, response): open('test.html', 'wb').write(response.body)
3.使用隨機user-agent
默認情況下scrapy采集時只能使用一種user-agent,這樣容易被網(wǎng)站屏蔽,下面的代碼可以從預先定義的user- agent的列表中隨機選擇一個來采集不同的頁面
在settings.py中添加以下代碼
DOWNLOADER_MIDDLEWARES = { 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'Crawler.comm.rotate_useragent.RotateUserAgentMiddleware' :400 }
注意: Crawler; 是你項目的名字 ,通過它是一個目錄的名稱 下面是蜘蛛的代碼
#!/usr/bin/python #-*-coding:utf-8-*- import random from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware class RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): #這句話用于隨機選擇user-agent ua = random.choice(self.user_agent_list) if ua: request.headers.setdefault('User-Agent', ua) #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\ "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\ "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
- Python爬蟲框架Scrapy安裝使用步驟
- 零基礎寫python爬蟲之使用Scrapy框架編寫爬蟲
- 使用scrapy實現(xiàn)爬網(wǎng)站例子和實現(xiàn)網(wǎng)絡爬蟲(蜘蛛)的步驟
- scrapy爬蟲完整實例
- 深入剖析Python的爬蟲框架Scrapy的結構與運作流程
- Python使用Scrapy爬蟲框架全站爬取圖片并保存本地的實現(xiàn)代碼
- Python的Scrapy爬蟲框架簡單學習筆記
- 實踐Python的爬蟲框架Scrapy來抓取豆瓣電影TOP250
- 使用Python的Scrapy框架編寫web爬蟲的簡單示例
- python爬蟲框架scrapy實戰(zhàn)之爬取京東商城進階篇
- 淺析python實現(xiàn)scrapy定時執(zhí)行爬蟲
- Scrapy爬蟲多線程導致抓取錯亂的問題解決
相關文章
Python MySQL數(shù)據(jù)庫基本操作及項目示例詳解
這篇文章主要介紹了Python連接MySQL數(shù)據(jù)庫后的一些基本操作,并以銀行管理系統(tǒng)項目為例,為大家具體介紹了一下部分功能的實現(xiàn),文中的示例代碼具有一定的學習價值,感興趣的可以了解一下2021-12-12Python+OpenCV實現(xiàn)黑白老照片上色功能
我們都知道,有很多經(jīng)典的老照片,受限于那個時代的技術,只能以黑白的形式傳世。盡管黑白照片別有一番風味,但是彩色照片有時候能給人更強的代入感。本文就來用Python和OpenCV實現(xiàn)老照片上色功能,需要的可以參考一下2023-02-02對Python 3.2 迭代器的next函數(shù)實例講解
今天小編就為大家分享一篇對Python 3.2 迭代器的next函數(shù)實例講解,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-10-10Python使用Flask-SQLAlchemy連接數(shù)據(jù)庫操作示例
這篇文章主要介紹了Python使用Flask-SQLAlchemy連接數(shù)據(jù)庫操作,簡單介紹了flask、Mysql-Python以及Flask-SQLAlchemy的安裝方法,并結合實例形式分析了基于Flask-SQLAlchemy的數(shù)據(jù)庫連接相關操作技巧,需要的朋友可以參考下2018-08-08MacOS(M1芯片?arm架構)下安裝tensorflow的詳細過程
這篇文章主要介紹了MacOS(M1芯片?arm架構)下如何安裝tensorflow,本節(jié)使用的版本是tensorflow2.4?python3.8,因此并未安裝加速插件,本文結合實例代碼詳細講解,需要的朋友可以參考下2023-02-02Python 數(shù)值區(qū)間處理_對interval 庫的快速入門詳解
今天小編就為大家分享一篇Python 數(shù)值區(qū)間處理_對interval 庫的快速入門詳解,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-11-11