欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

 更新時(shí)間:2021年06月02日 11:29:24   作者:濯君  
在用Python的urllib和BeautifulSoup寫(xiě)過(guò)了很多爬蟲(chóng)之后,本人決定嘗試著名的Python爬蟲(chóng)框架——Scrapy.本次分享將詳細(xì)講述如何利用Scrapy來(lái)下載豆瓣名人圖片,需要的朋友可以參考下

使用Scrapy爬取豆瓣某影星的所有個(gè)人圖片

莫妮卡·貝魯奇為例

在這里插入圖片描述

1.首先我們?cè)诿钚羞M(jìn)入到我們要?jiǎng)?chuàng)建的目錄,輸入 scrapy startproject banciyuan 創(chuàng)建scrapy項(xiàng)目

創(chuàng)建的項(xiàng)目結(jié)構(gòu)如下

在這里插入圖片描述

2.為了方便使用pycharm執(zhí)行scrapy項(xiàng)目,新建main.py

from scrapy import cmdline

cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

在這里插入圖片描述

然后進(jìn)行如下設(shè)置,設(shè)置后之后就能通過(guò)運(yùn)行main.py運(yùn)行scrapy項(xiàng)目了

在這里插入圖片描述

3.分析該HTML頁(yè)面,創(chuàng)建對(duì)應(yīng)spider

在這里插入圖片描述

from scrapy import Spider
import scrapy

from banciyuan.items import BanciyuanItem


class BanciyuanSpider(Spider):
    name = 'banciyuan'
    allowed_domains = ['movie.douban.com']
    start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
    url = "https://movie.douban.com/celebrity/1025156/photos/"

    def parse(self, response):
        num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
        print(num)
        for i in range(int(num)):
            suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
            yield scrapy.Request(url=self.url + suffix, callback=self.get_page)

    def get_page(self, response):
        href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
        # print(href_list)
        for href in href_list:
            yield scrapy.Request(url=href, callback=self.get_info)

    def get_info(self, response):
        src = response.xpath(
            '//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
        title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
        # print(response.body)
        item = BanciyuanItem()
        item['title'] = title
        item['src'] = [src]
        yield item

4.items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BanciyuanItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    title = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class BanciyuanPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        yield scrapy.Request(url=item['src'][0], meta={'item': item})

    def file_path(self, request, response=None, info=None, *, item=None):
        item = request.meta['item']
        image_name = item['src'][0].split('/')[-1]
        # image_name.replace('.webp', '.jpg')
        path = '%s/%s' % (item['title'].split(' ')[0], image_name)

        return path

settings.py

# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'banciyuan'

SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取結(jié)果

在這里插入圖片描述

reference

源碼

到此這篇關(guān)于Python爬蟲(chóng)實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片的文章就介紹到這了,更多相關(guān)Scrapy爬取豆瓣圖片內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

  • 利用Python寫(xiě)個(gè)簡(jiǎn)易版星空大戰(zhàn)游戲

    利用Python寫(xiě)個(gè)簡(jiǎn)易版星空大戰(zhàn)游戲

    通過(guò)小編觀察,大家好像對(duì)劃水摸魚(yú)是情有獨(dú)鐘啊。所以本文給大家?guī)?lái)了一個(gè)用Python編寫(xiě)的簡(jiǎn)單版的星空大戰(zhàn)小游戲,感興趣的小伙伴可以動(dòng)手試一試
    2022-03-03
  • 解決pandas展示數(shù)據(jù)輸出時(shí)列名不能對(duì)齊的問(wèn)題

    解決pandas展示數(shù)據(jù)輸出時(shí)列名不能對(duì)齊的問(wèn)題

    今天小編就為大家分享一篇解決pandas展示數(shù)據(jù)輸出時(shí)列名不能對(duì)齊的問(wèn)題,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧
    2019-11-11
  • 使用python去除圖片白色像素的實(shí)例

    使用python去除圖片白色像素的實(shí)例

    今天小編就為大家分享一篇使用python去除圖片白色像素的實(shí)例,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧
    2019-12-12
  • ?Python錯(cuò)誤與異常處理

    ?Python錯(cuò)誤與異常處理

    這篇文章主要介紹了?Python錯(cuò)誤與異常處理,錯(cuò)誤與異常處理在Python中具有非常重要的地位,熟練的使用錯(cuò)誤與異常處理能夠?yàn)槲覀兊腜ython編程提供很多的便利之處,希望您閱讀完本文后能夠有所收獲
    2022-01-01
  • Numpy一維線性插值函數(shù)的用法

    Numpy一維線性插值函數(shù)的用法

    這篇文章主要介紹了Numpy一維線性插值函數(shù)的用法,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧
    2020-04-04
  • Python字符串處理的8招秘籍(小結(jié))

    Python字符串處理的8招秘籍(小結(jié))

    這篇文章主要介紹了Python字符串處理的8招秘籍,文中通過(guò)示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧
    2019-08-08
  • python識(shí)別驗(yàn)證碼的思路及解決方案

    python識(shí)別驗(yàn)證碼的思路及解決方案

    在本篇內(nèi)容里小編給大家整理的是一篇關(guān)于python識(shí)別驗(yàn)證碼的思路及解決方案,有需要的朋友們可以參考下。
    2020-09-09
  • 詳解Python中的文件操作

    詳解Python中的文件操作

    這篇文章主要介紹了Python中文件操作的相關(guān)資料,幫助大家更好的理解和學(xué)習(xí)python,感興趣的朋友可以了解下
    2021-01-01
  • Python數(shù)據(jù)分析pandas模塊用法實(shí)例詳解

    Python數(shù)據(jù)分析pandas模塊用法實(shí)例詳解

    這篇文章主要介紹了Python數(shù)據(jù)分析pandas模塊用法,結(jié)合實(shí)例形式分析了pandas模塊對(duì)象創(chuàng)建、數(shù)值運(yùn)算等相關(guān)操作技巧與注意事項(xiàng),需要的朋友可以參考下
    2019-11-11
  • python爬蟲(chóng)scrapy框架的梨視頻案例解析

    python爬蟲(chóng)scrapy框架的梨視頻案例解析

    這篇文章主要介紹了python爬蟲(chóng)scrapy框架的梨視頻案例解析,本文通過(guò)實(shí)例代碼給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下
    2021-02-02

最新評(píng)論