Python使用scrapy爬取陽光熱線問政平臺過程解析

更新時間：2019年08月14日 09:57:42 作者：silence-cc

這篇文章主要介紹了Python使用scrapy爬取陽光熱線問政平臺過程解析,文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

目的：爬取陽光熱線問政平臺問題反映每個帖子里面的標題、內容、編號和帖子url

CrawlSpider版流程如下：

創(chuàng)建爬蟲項目dongguang

scrapy startproject dongguang

設置items.py文件

# -*- coding: utf-8 -*-
import scrapy
class NewdongguanItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  # pass
  # 每頁的帖子鏈接
  url = scrapy.Field()
  # 帖子標題
  title = scrapy.Field()
  # 帖子編號
  number = scrapy.Field()
  # 帖子內容
  content = scrapy.Field()

在spiders目錄里面，創(chuàng)建并編寫爬蟲文件sun.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from dongguan.items import DongguanItem
class SunSpider(CrawlSpider):
  name = 'dg'
  allowed_domains = ['wz.sun0769.com']
  start_urls = ['http://wz.sun0769.com/html/top/report.shtml']
  # rules是Rule的集合，每個rule規(guī)則同時執(zhí)行。另外，如果發(fā)現web服務器有反爬蟲機制如返回一個假的url，則可以使用Rule里面的參數process_links調用一個自編函數來處理url后返回一個真的url
  rules = (
    # 每個url都有一個獨一無二的指紋，每個爬蟲項目都有一個去重隊列
    # Rule里面沒有回調函數，則默認對匹配的鏈接要跟進，就是對匹配的鏈接在進行請求獲取響應后對響應里面匹配的鏈接繼續(xù)跟進，只不過沒有回調函數對響應數據進行處理
    # Rule(LinkExtractor(allow="page="))如果設置為follow=False,則不會跟進，只顯示當前頁面匹配的鏈接。如設置為follow=True，則會對每個匹配的鏈接發(fā)送請求獲取響應進而從每個響應里面再次匹配跟進，直至沒有。python遞歸深度默認為不超過1000，否則會報異常
    Rule(LinkExtractor(allow="page=")),

    Rule(LinkExtractor(allow='http://wz.sun0769.com/html/question/\d+/\d+.shtml'),callback='parse_item')

  )

  def parse_item(self, response):
    print(response.url)
    item = DongguanItem()
    item['url'] = response.url
    item['title'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0]
    item['number'] = response.xpath('//div[@class="pagecenter p3"]//strong/text()').extract()[0].split(' ')[-1].split(':')[-1]
     # 對帖子里面有圖片的處理，發(fā)現沒有圖片時則沒有class="contentext"的div標簽，以此作為標準獲取帖子內容
    if len(response.xpath('//div[@class="contentext"]')) == 0:
      item['content'] = ''.join(response.xpath('//div[@class="c1 text14_2"]/text()').extract())
    else:
      item['content'] = ''.join(response.xpath('//div[@class="contentext"]/text()').extract())
    yield item

編寫管道pipelines.py文件

# -*- coding: utf-8 -*-
import json
class DongguanPipeline(object):
  def __init__(self):
    self.file = open('dongguan.json','w')
  def process_item(self, item, spider):
    content = json.dumps(dict(item),ensure_ascii=False).encode('utf-8') + '\n'
    self.file.write(content)
    return item
  def closespider(self):
    self.file.close()

編寫settings.py文件

# -*- coding: utf-8 -*-
BOT_NAME = 'dongguan'
SPIDER_MODULES = ['dongguan.spiders']
NEWSPIDER_MODULE = 'dongguan.spiders'
# log日志文件默認保存在當前目錄，下面為日志級別，當大于或等于INFO時將被保存
LOG_FILE = 'dongguan.log'
LOG_LEVEL = 'INFO'
# 爬取深度設置
# DEPTH_LIMIT = 1
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'dongguan (+http://www.yourdomain.com)'
# Obey robots.txt rules
# ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'dongguan.pipelines.DongguanPipeline': 300,
}

測試運行爬蟲，終端執(zhí)行命令（只要在項目目錄內即可）

scrapy crawl dg

Spider版流程如下：

創(chuàng)建爬蟲項目newdongguang

scrapy startproject newdongguan

設置items.py文件

# -*- coding: utf-8 -*-
  import scrapy
  class NewdongguanItem(scrapy.Item):
    # 每頁的帖子鏈接
    url = scrapy.Field()
    # 帖子標題
    title = scrapy.Field()
    # 帖子編號
    number = scrapy.Field()
    # 帖子內容
    content = scrapy.Field()

在spiders目錄里面，創(chuàng)建并編寫爬蟲文件newsun.py

# -*- coding: utf-8 -*-
import scrapy
from newdongguan.items import NewdongguanItem
class NewsunSpider(scrapy.Spider):
  name = 'ndg'
  # 設置爬取的域名范圍，可寫可不寫，不寫則表示爬取時候不限域名，結果有可能會導致爬蟲失控。
  allowed_domains = ['wz.sun0769.com']
  offset = 0
  url = 'http://wz.sun0769.com/index.php/question/report?page=' + str(offset)
  start_urls = [url]
  def parse(self, response):
    link_list = response.xpath("http://a[@class='news14']/@href").extract()
    for each in link_list:
      # 對每頁的帖子發(fā)送請求，獲取帖子內容里面指定數據返回給管道文件
      yield scrapy.Request(each,callback=self.deal_link)
    self.offset += 30
    if self.offset <= 124260:
      url = 'http://wz.sun0769.com/index.php/question/report?page=' + str(self.offset)
      # 對指定分頁發(fā)送請求，響應交給parse函數處理
      yield scrapy.Request(url,callback=self.parse)

  # 從每個分頁帖子內容獲取數據，返回給管道
  def deal_link(self,response):
    item = NewdongguanItem()
    item['url'] = response.url
    item['title'] = response.xpath("http://div[@class='pagecenter p3']//strong[@class='tgray14']/text()").extract()[0]
    item['number'] = response.xpath("http://div[@class='pagecenter p3']//strong[@class='tgray14']/text()").extract()[0].split(' ')[-1].split(':')[-1]

    if len(response.xpath("http://div[@class='contentext']")) == 0:
      item['content'] = ''.join(response.xpath("http://div[@class='c1 text14_2']/text()").extract())
    else:
      item['content'] = ''.join(response.xpath("http://div[@class='contentext']/text()").extract())
    yield item

編寫管道pipelines.py文件

# -*- coding: utf-8 -*-
import codecs
import json
class NewdongguanPipeline(object):

  def __init__(self):
    # 使用codecs寫文件，直接設置文件內容編碼格式，省去每次都要對內容進行編碼
    self.file = codecs.open('newdongguan.json','w',encoding = 'utf-8')
    # 以前文件寫法
    # self.file = open('newdongguan.json','w')

  def process_item(self, item, spider):
    print(item['title'])
    content = json.dumps(dict(item),ensure_ascii=False) + '\n'
    # 以前文件寫法
    # self.file.write(content.encode('utf-8'))
    self.file.write(content)
    return item

  def close_spider(self):
    self.file.close()

編寫settings.py文件

# -*- coding: utf-8 -*-
BOT_NAME = 'newdongguan'
SPIDER_MODULES = ['newdongguan.spiders']
NEWSPIDER_MODULE = 'newdongguan.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'newdongguan (+http://www.yourdomain.com)'
USER_AGENT = 'User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;'
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'newdongguan.pipelines.NewdongguanPipeline': 300,
}

測試運行爬蟲，終端執(zhí)行命

srapy crawl ndg

備注：markdown語法關于代碼塊縮進問題，可通過tab鍵來解決。而簡單文本則可以通過回車鍵來解決，如Spider版流程如下：和1. 創(chuàng)建爬蟲項目newdongguang

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章:

Python 通過打碼平臺實現驗證碼的實現
這篇文章主要介紹了Python 通過打碼平臺實現驗證碼的實現，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2019-05-05
python sorted方法和列表使用解析
這篇文章主要介紹了python sorted方法和列表使用解析,文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下
2019-11-11
Python 12306搶火車票腳本
這篇文章主要為大家詳細介紹了Python 12306搶火車票腳本，具有一定的參考價值，感興趣的小伙伴們可以參考一下
2018-02-02
python使用dataframe_image將dataframe表格轉為圖片
本文主要介紹了python使用dataframe_image將dataframe表格轉為圖片,文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,需要的朋友們下面隨著小編來一起學習學習吧
2024-01-01
詳解python深淺拷貝區(qū)別
在本篇文章里小編給大家整理了關于python深淺拷貝區(qū)別的相關知識點總結，有興趣的朋友們可以參考下。
2019-06-06
Python讀取URL生成PDF的方法步驟
URL（Uniform Resource Locator）是用于標識和定位網絡上資源的字符串,本文將給大家介紹Python讀取URL生成PDF的方法步驟,文中有相關的代碼示例和圖文講解供大家參考,需要的朋友可以參考下
2024-07-07
Django 忘記管理員或忘記管理員密碼重設登錄密碼的方法
今天小編就為大家分享一篇Django 忘記管理員或忘記管理員密碼重設登錄密碼的方法，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧
2018-05-05
Python多繼承原理與用法示例
這篇文章主要介紹了Python多繼承原理與用法,簡單描述了Python多繼承的相關概念、原理并結合實例形式分析了Python多繼承的具體定義、使用方法及相關操作注意事項,需要的朋友可以參考下
2018-08-08
Python Asyncio調度原理詳情
這篇文章主要介紹了Python Asyncio調度原理詳情，Python.Asyncio是一個大而全的庫，它包括很多功能，而跟核心調度相關的邏輯除了三種可等待對象外，還有其它一些功能，它們分別位于runners.py，base_event.py，event.py三個文件中
2022-06-06
Python交互環(huán)境下打印和輸入函數的實例內容
在本篇文章里小編給大家分享的是關于Python交互環(huán)境下打印和輸入函數的實例內容，有興趣的朋友們可以學習下。
2020-02-02