快捷導(dǎo)航

Python大數(shù)據(jù)之從網(wǎng)頁(yè)上爬取數(shù)據(jù)的方法詳解

更新時(shí)間：2019年11月16日 10:43:42 作者：xuehyunyu

這篇文章主要介紹了Python大數(shù)據(jù)之從網(wǎng)頁(yè)上爬取數(shù)據(jù)的方法,結(jié)合實(shí)例形式詳細(xì)分析了Python爬蟲(chóng)爬取網(wǎng)頁(yè)數(shù)據(jù)的相關(guān)操作技巧,需要的朋友可以參考下

本文實(shí)例講述了Python大數(shù)據(jù)之從網(wǎng)頁(yè)上爬取數(shù)據(jù)的方法。分享給大家供大家參考，具體如下：

myspider.py ：

#!/usr/bin/python
# -*- coding:utf-8 -*-
from scrapy.spiders import Spider
from lxml import etree
from jredu.items import JreduItem
class JreduSpider(Spider):
  name = 'tt' #爬蟲(chóng)的名字，必須的，唯一的
  allowed_domains = ['sohu.com']
  start_urls = [
    'http://www.sohu.com'
  ]
  def parse(self, response):
    content = response.body.decode('utf-8')
    dom = etree.HTML(content)
    for ul in dom.xpath("http://div[@class='focus-news-box']/div[@class='list16']/ul"):
      lis = ul.xpath("./li")
      for li in lis:
        item = JreduItem() #定義對(duì)象
        if ul.index(li) == 0:
          strong = li.xpath("./a/strong/text()")
          li.xpath("./a/@href")
          item['title']= strong[0]
          item['href'] = li.xpath("./a/@href")[0]
        else:
          la = li.xpath("./a[last()]/text()")
          item['title'] = la[0]
          item['href'] = li.xpath("./a[last()]/href")[0]
        yield item

items.py ：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class JreduItem(scrapy.Item):#相當(dāng)于Java里的實(shí)體類(lèi)
  # define the fields for your item here like:
  # name = scrapy.Field()
  title = scrapy.Field()#創(chuàng)建一個(gè)field對(duì)象
  href = scrapy.Field()
  pass

middlewares.py ：

# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
class JreduSpiderMiddleware(object):
  # Not all methods need to be defined. If a method is not defined,
  # scrapy acts as if the spider middleware does not modify the
  # passed objects.
  @classmethod
  def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s
  def process_spider_input(self, response, spider):
    # Called for each response that goes through the spider
    # middleware and into the spider.
    # Should return None or raise an exception.
    return None
  def process_spider_output(self, response, result, spider):
    # Called with the results returned from the Spider, after
    # it has processed the response.
    # Must return an iterable of Request, dict or Item objects.
    for i in result:
      yield i
  def process_spider_exception(self, response, exception, spider):
    # Called when a spider or process_spider_input() method
    # (from other spider middleware) raises an exception.
    # Should return either None or an iterable of Response, dict
    # or Item objects.
    pass
  def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn't have a response associated.
    # Must return only requests (not items).
    for r in start_requests:
      yield r
  def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)

pipelines.py :

# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import codecs
import json
class JreduPipeline(object):
  def __init__(self):
    self.fill = codecs.open("data.txt",encoding="utf-8",mode="wb");
  def process_item(self, item, spider):
    line = json.dumps(dict(item))+"\n"
    self.fill.write(line)
    return item

settings.py ：

# -*- coding: utf-8 -*-
# Scrapy settings for jredu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#   http://doc.scrapy.org/en/latest/topics/settings.html
#   http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#   http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jredu'
SPIDER_MODULES = ['jredu.spiders']
NEWSPIDER_MODULE = 'jredu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jredu (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#  'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#  'jredu.middlewares.JreduSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#  'jredu.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#  'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
  'jredu.pipelines.JreduPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

最后需要一個(gè)程序入口的方法：

main.py ：

#!/usr/bin/python
# -*- coding:utf-8 -*-
#爬蟲(chóng)文件的執(zhí)行入口
from scrapy import cmdline
cmdline.execute("scrapy crawl tt".split())

更多關(guān)于Python相關(guān)內(nèi)容可查看本站專(zhuān)題：《Python Socket編程技巧總結(jié)》、《Python正則表達(dá)式用法總結(jié)》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》、《Python入門(mén)與進(jìn)階經(jīng)典教程》及《Python文件與目錄操作技巧匯總》

希望本文所述對(duì)大家Python程序設(shè)計(jì)有所幫助。

您可能感興趣的文章:

相關(guān)文章

Python基礎(chǔ)教程之利用期物處理并發(fā)
這篇文章主要給大家介紹了關(guān)于Python基礎(chǔ)教程之利用期物處理并發(fā)的相關(guān)資料，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面來(lái)一起學(xué)習(xí)學(xué)習(xí)吧。
2018-03-03
利用python操作SQLite數(shù)據(jù)庫(kù)及文件操作詳解
這篇文章主要給大家介紹了關(guān)于利用python操作SQLite數(shù)據(jù)庫(kù)及文件操作的相關(guān)資料，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧。
2017-09-09
Jupyter?notebook運(yùn)行后打不開(kāi)網(wǎng)頁(yè)的問(wèn)題解決
本文主要介紹了Jupyter?notebook運(yùn)行后打不開(kāi)網(wǎng)頁(yè)的問(wèn)題解決，文中通過(guò)示例代碼介紹的非常詳細(xì)，具有一定的參考價(jià)值，感興趣的小伙伴們可以參考一下
2022-03-03
這篇文章主要介紹了python實(shí)現(xiàn)的簡(jiǎn)單文本類(lèi)游戲,以?xún)蓚€(gè)實(shí)例形式分析了python操作文本與字符串的相關(guān)技巧,具有一定參考借鑒價(jià)值,需要的朋友可以參考下
2015-04-04

詳解Django項(xiàng)目中模板標(biāo)簽及模板的繼承與引用(網(wǎng)站中快速布置廣告)

這篇文章主要介紹了詳解Django項(xiàng)目中模板標(biāo)簽及模板的繼承與引用【網(wǎng)站中快速布置廣告】，小編覺(jué)得挺不錯(cuò)的，現(xiàn)在分享給大家，也給大家做個(gè)參考。一起跟隨小編過(guò)來(lái)看看吧

2019-03-03

利用Pandas索引和選取數(shù)據(jù)方法詳解

使用Pandas做數(shù)據(jù)分析的時(shí)候，用的最多的功能恐怕就是對(duì)于數(shù)據(jù)集的索引，選組數(shù)據(jù)子集。Pandas庫(kù)提供了很多非常實(shí)用的方法，了解并熟練使用這些方法而不是用for循環(huán)的方法將會(huì)事半功倍。在這一篇文章中，我們將著重介紹這些方法

2021-10-10

python計(jì)算階乘和的方法(1!+2!+3!+...+n!)

今天小編就為大家分享一篇python計(jì)算階乘和的方法(1!+2!+3!+...+n!)，具有很好的參考價(jià)值，希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧

2019-02-02

python中requests庫(kù)安裝與使用詳解

requests是一個(gè)很實(shí)用的Python HTTP客戶(hù)端庫(kù),爬蟲(chóng)和測(cè)試服務(wù)器響應(yīng)數(shù)據(jù)時(shí)經(jīng)常會(huì)用到,下面這篇文章主要給大家介紹了關(guān)于python中requests庫(kù)安裝與使用的相關(guān)資料,需要的朋友可以參考下

2022-07-07

pytorch之Resize()函數(shù)具體使用詳解

這篇文章主要介紹了pytorch之Resize()函數(shù)具體使用詳解，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

2020-02-02

終止python代碼運(yùn)行的3種方式詳析

這篇文章主要給大家介紹了關(guān)于終止python代碼運(yùn)行的3種方式,python是解釋運(yùn)行的程序,程序進(jìn)入死循環(huán)或者其它異常都會(huì)導(dǎo)致程序無(wú)法正常結(jié)束,需要的朋友可以參考下

2023-07-07

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python大數(shù)據(jù)之從網(wǎng)頁(yè)上爬取數(shù)據(jù)的方法詳解

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線(xiàn)小工具