Python使用scrapy爬取陽(yáng)光熱線問(wèn)政平臺(tái)過(guò)程解析
目的:爬取陽(yáng)光熱線問(wèn)政平臺(tái)問(wèn)題反映每個(gè)帖子里面的標(biāo)題、內(nèi)容、編號(hào)和帖子url
CrawlSpider版流程如下:
創(chuàng)建爬蟲項(xiàng)目dongguang
scrapy startproject dongguang
設(shè)置items.py文件
# -*- coding: utf-8 -*- import scrapy class NewdongguanItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() # pass # 每頁(yè)的帖子鏈接 url = scrapy.Field() # 帖子標(biāo)題 title = scrapy.Field() # 帖子編號(hào) number = scrapy.Field() # 帖子內(nèi)容 content = scrapy.Field()
在spiders目錄里面,創(chuàng)建并編寫爬蟲文件sun.py
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from dongguan.items import DongguanItem class SunSpider(CrawlSpider): name = 'dg' allowed_domains = ['wz.sun0769.com'] start_urls = ['http://wz.sun0769.com/html/top/report.shtml'] # rules是Rule的集合,每個(gè)rule規(guī)則同時(shí)執(zhí)行。另外,如果發(fā)現(xiàn)web服務(wù)器有反爬蟲機(jī)制如返回一個(gè)假的url,則可以使用Rule里面的參數(shù)process_links調(diào)用一個(gè)自編函數(shù)來(lái)處理url后返回一個(gè)真的url rules = ( # 每個(gè)url都有一個(gè)獨(dú)一無(wú)二的指紋,每個(gè)爬蟲項(xiàng)目都有一個(gè)去重隊(duì)列 # Rule里面沒有回調(diào)函數(shù),則默認(rèn)對(duì)匹配的鏈接要跟進(jìn),就是對(duì)匹配的鏈接在進(jìn)行請(qǐng)求獲取響應(yīng)后對(duì)響應(yīng)里面匹配的鏈接繼續(xù)跟進(jìn),只不過(guò)沒有回調(diào)函數(shù)對(duì)響應(yīng)數(shù)據(jù)進(jìn)行處理 # Rule(LinkExtractor(allow="page="))如果設(shè)置為follow=False,則不會(huì)跟進(jìn),只顯示當(dāng)前頁(yè)面匹配的鏈接。如設(shè)置為follow=True,則會(huì)對(duì)每個(gè)匹配的鏈接發(fā)送請(qǐng)求獲取響應(yīng)進(jìn)而從每個(gè)響應(yīng)里面再次匹配跟進(jìn),直至沒有。python遞歸深度默認(rèn)為不超過(guò)1000,否則會(huì)報(bào)異常 Rule(LinkExtractor(allow="page=")), Rule(LinkExtractor(allow='http://wz.sun0769.com/html/question/\d+/\d+.shtml'),callback='parse_item') ) def parse_item(self, response): print(response.url) item = DongguanItem() item['url'] = response.url item['title'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0] item['number'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0].split(' ')[-1].split(':')[-1] # 對(duì)帖子里面有圖片的處理,發(fā)現(xiàn)沒有圖片時(shí)則沒有class="contentext"的div標(biāo)簽,以此作為標(biāo)準(zhǔn)獲取帖子內(nèi)容 if len(response.xpath('//div[@class="contentext"]')) == 0: item['content'] = ''.join(response.xpath('//div[@class="c1 text14_2"]/text()').extract()) else: item['content'] = ''.join(response.xpath('//div[@class="contentext"]/text()').extract()) yield item
編寫管道pipelines.py文件
# -*- coding: utf-8 -*- import json class DongguanPipeline(object): def __init__(self): self.file = open('dongguan.json','w') def process_item(self, item, spider): content = json.dumps(dict(item),ensure_ascii=False).encode('utf-8') + '\n' self.file.write(content) return item def closespider(self): self.file.close()
編寫settings.py文件
# -*- coding: utf-8 -*- BOT_NAME = 'dongguan' SPIDER_MODULES = ['dongguan.spiders'] NEWSPIDER_MODULE = 'dongguan.spiders' # log日志文件默認(rèn)保存在當(dāng)前目錄,下面為日志級(jí)別,當(dāng)大于或等于INFO時(shí)將被保存 LOG_FILE = 'dongguan.log' LOG_LEVEL = 'INFO' # 爬取深度設(shè)置 # DEPTH_LIMIT = 1 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'dongguan (+http://www.yourdomain.com)' # Obey robots.txt rules # ROBOTSTXT_OBEY = True # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'dongguan.pipelines.DongguanPipeline': 300, }
測(cè)試運(yùn)行爬蟲,終端執(zhí)行命令(只要在項(xiàng)目目錄內(nèi)即可)
scrapy crawl dg
Spider版流程如下:
創(chuàng)建爬蟲項(xiàng)目newdongguang
scrapy startproject newdongguan
設(shè)置items.py文件
# -*- coding: utf-8 -*- import scrapy class NewdongguanItem(scrapy.Item): # 每頁(yè)的帖子鏈接 url = scrapy.Field() # 帖子標(biāo)題 title = scrapy.Field() # 帖子編號(hào) number = scrapy.Field() # 帖子內(nèi)容 content = scrapy.Field()
在spiders目錄里面,創(chuàng)建并編寫爬蟲文件newsun.py
# -*- coding: utf-8 -*- import scrapy from newdongguan.items import NewdongguanItem class NewsunSpider(scrapy.Spider): name = 'ndg' # 設(shè)置爬取的域名范圍,可寫可不寫,不寫則表示爬取時(shí)候不限域名,結(jié)果有可能會(huì)導(dǎo)致爬蟲失控。 allowed_domains = ['wz.sun0769.com'] offset = 0 url = 'http://wz.sun0769.com/index.php/question/report?page=' + str(offset) start_urls = [url] def parse(self, response): link_list = response.xpath("http://a[@class='news14']/@href").extract() for each in link_list: # 對(duì)每頁(yè)的帖子發(fā)送請(qǐng)求,獲取帖子內(nèi)容里面指定數(shù)據(jù)返回給管道文件 yield scrapy.Request(each,callback=self.deal_link) self.offset += 30 if self.offset <= 124260: url = 'http://wz.sun0769.com/index.php/question/report?page=' + str(self.offset) # 對(duì)指定分頁(yè)發(fā)送請(qǐng)求,響應(yīng)交給parse函數(shù)處理 yield scrapy.Request(url,callback=self.parse) # 從每個(gè)分頁(yè)帖子內(nèi)容獲取數(shù)據(jù),返回給管道 def deal_link(self,response): item = NewdongguanItem() item['url'] = response.url item['title'] = response.xpath("http://div[@class='pagecenter p3']//strong[@class='tgray14']/text()").extract()[0] item['number'] = response.xpath("http://div[@class='pagecenter p3']//strong[@class='tgray14']/text()").extract()[0].split(' ')[-1].split(':')[-1] if len(response.xpath("http://div[@class='contentext']")) == 0: item['content'] = ''.join(response.xpath("http://div[@class='c1 text14_2']/text()").extract()) else: item['content'] = ''.join(response.xpath("http://div[@class='contentext']/text()").extract()) yield item
編寫管道pipelines.py文件
# -*- coding: utf-8 -*- import codecs import json class NewdongguanPipeline(object): def __init__(self): # 使用codecs寫文件,直接設(shè)置文件內(nèi)容編碼格式,省去每次都要對(duì)內(nèi)容進(jìn)行編碼 self.file = codecs.open('newdongguan.json','w',encoding = 'utf-8') # 以前文件寫法 # self.file = open('newdongguan.json','w') def process_item(self, item, spider): print(item['title']) content = json.dumps(dict(item),ensure_ascii=False) + '\n' # 以前文件寫法 # self.file.write(content.encode('utf-8')) self.file.write(content) return item def close_spider(self): self.file.close()
編寫settings.py文件
# -*- coding: utf-8 -*- BOT_NAME = 'newdongguan' SPIDER_MODULES = ['newdongguan.spiders'] NEWSPIDER_MODULE = 'newdongguan.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'newdongguan (+http://www.yourdomain.com)' USER_AGENT = 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;' # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'newdongguan.pipelines.NewdongguanPipeline': 300, }
測(cè)試運(yùn)行爬蟲,終端執(zhí)行命
srapy crawl ndg
備注:markdown語(yǔ)法關(guān)于代碼塊縮進(jìn)問(wèn)題,可通過(guò)tab鍵來(lái)解決。而簡(jiǎn)單文本則可以通過(guò)回車鍵來(lái)解決,如Spider版流程如下:和1. 創(chuàng)建爬蟲項(xiàng)目newdongguang
以上就是本文的全部?jī)?nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
- Python Scrapy框架第一個(gè)入門程序示例
- python3 Scrapy爬蟲框架ip代理配置的方法
- Python利用Scrapy框架爬取豆瓣電影示例
- python scrapy重復(fù)執(zhí)行實(shí)現(xiàn)代碼詳解
- Python scrapy增量爬取實(shí)例及實(shí)現(xiàn)過(guò)程解析
- VirtualBox CentOS7.7.1908 Python3.8 搭建Scrapy開發(fā)環(huán)境【圖文教程】
- python網(wǎng)絡(luò)爬蟲 Scrapy中selenium用法詳解
- python scrapy爬蟲代碼及填坑
- 基于python框架Scrapy爬取自己的博客內(nèi)容過(guò)程詳解
- Python爬蟲 scrapy框架爬取某招聘網(wǎng)存入mongodb解析
- Python3環(huán)境安裝Scrapy爬蟲框架過(guò)程及常見錯(cuò)誤
- 圖文詳解python安裝Scrapy框架步驟
- 一步步教你用python的scrapy編寫一個(gè)爬蟲
- 詳解python3 + Scrapy爬蟲學(xué)習(xí)之創(chuàng)建項(xiàng)目
- 詳解Python網(wǎng)絡(luò)框架Django和Scrapy安裝指南
- Scrapy框架爬取Boss直聘網(wǎng)Python職位信息的源碼
- windows下搭建python scrapy爬蟲框架步驟
- python爬蟲庫(kù)scrapy簡(jiǎn)單使用實(shí)例詳解
相關(guān)文章
Python 通過(guò)打碼平臺(tái)實(shí)現(xiàn)驗(yàn)證碼的實(shí)現(xiàn)
這篇文章主要介紹了Python 通過(guò)打碼平臺(tái)實(shí)現(xiàn)驗(yàn)證碼的實(shí)現(xiàn),文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2019-05-05python使用dataframe_image將dataframe表格轉(zhuǎn)為圖片
本文主要介紹了python使用dataframe_image將dataframe表格轉(zhuǎn)為圖片,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧2024-01-01Django 忘記管理員或忘記管理員密碼 重設(shè)登錄密碼的方法
今天小編就為大家分享一篇Django 忘記管理員或忘記管理員密碼 重設(shè)登錄密碼的方法,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2018-05-05Python交互環(huán)境下打印和輸入函數(shù)的實(shí)例內(nèi)容
在本篇文章里小編給大家分享的是關(guān)于Python交互環(huán)境下打印和輸入函數(shù)的實(shí)例內(nèi)容,有興趣的朋友們可以學(xué)習(xí)下。2020-02-02