使用python?scrapy爬取天氣并導(dǎo)出csv文件

更新時間：2022年08月06日 10:57:05 作者：Haohao+++

由于工作需要,將爬蟲的文件要保存為csv,以前只是保存為json,下面這篇文章主要給大家介紹了關(guān)于如何使用python?scrapy爬取天氣并導(dǎo)出csv文件的相關(guān)資料,需要的朋友可以參考下

爬取xxx天氣

爬取網(wǎng)址：https://tianqi.2345.com/today-60038.htm

安裝

pip install scrapy

我使用的版本是scrapy 2.5

創(chuàng)建scray爬蟲項目

在命令行如下輸入命令

scrapy startproject name

name為項目名稱
如，scrapy startproject spider_weather
之后再輸入

scrapy genspider spider_name 域名

如，scrapy genspider changshu tianqi.2345.com

查看文件夾

- spider_weather
   - spider
       - __init__.py
       - changshu.py
   - __init__.py
   - items.py
   - middlewares.py
   - pipelines.py
   - settings.py
- scrapy.cfg

文件說明

名稱	作用
scrapy.cfg	項目的配置信息，主要為Scrapy命令行工具提供一個基礎(chǔ)的配置信息。（真正爬蟲相關(guān)的配置信息在settings.py文件中）
items.py	設(shè)置數(shù)據(jù)存儲模板，用于結(jié)構(gòu)化數(shù)據(jù)，如：Django的Model
pipelines	數(shù)據(jù)處理行為，如：一般結(jié)構(gòu)化的數(shù)據(jù)持久化
settings.py	配置文件，如：遞歸的層數(shù)、并發(fā)數(shù)，延遲下載等
spiders	爬蟲目錄，如：創(chuàng)建文件，編寫爬蟲規(guī)則

開始爬蟲

1.在spiders文件夾里面對自己創(chuàng)建的爬蟲文件進(jìn)行數(shù)據(jù)爬取、如在此案例中的spiders/changshu.py

代碼演示如下

import scrapy

class ChangshuSpider(scrapy.Spider):
    name = 'changshu'
    allowed_domains = ['tianqi.2345.com']
    start_urls = ['https://tianqi.2345.com/today-60038.htm']

    def parse(self, response):
        # 日期、天氣狀態(tài)、溫度、風(fēng)級
        # 利用xpath解析數(shù)據(jù)、不會xpath的同學(xué)可以去稍微學(xué)習(xí)一下，語法簡單
        dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall()
        states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall()
        temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall()
        winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall()
        # 返回每條數(shù)據(jù)
        for date, state, temp, wind in zip(dates,states,temps,winds):
            yield {
                'date' : date,
                'state': state,
                'temp': temp,
                'wind': wind
            }

2.在settings.py文件中進(jìn)行配置

修改UA

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

修改機(jī)器爬蟲配置

ROBOTSTXT_OBEY = False

整個文件如下：

# Scrapy settings for spider_weather project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'spider_weather'

SPIDER_MODULES = ['spider_weather.spiders']
NEWSPIDER_MODULE = 'spider_weather.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# ITEM_PIPELINES = {
#    'spider_weather.pipelines.SpiderWeatherPipeline': 300,
# }

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3.然后在命令行中輸入如下代碼

scrapy crawl changshu -o weather.csv

注意：需要進(jìn)入spider_weather路徑下運行
scrapy crawl 文件名 -o weather.csv（導(dǎo)出文件）

4.結(jié)果如下

補充：scrapy導(dǎo)出csv時字段的一些問題

scrapy -o csv格式輸出的時候，發(fā)現(xiàn)輸出文件中字段的順序不是按照items.py中的順序，也不是爬蟲文件中寫入的順序，這樣導(dǎo)出的數(shù)據(jù)因為某些字段變得不好看，此外，導(dǎo)出得csv文件不同的item之間被空行隔開，本文主要描述解決這些問題的方法。

1.字段順序問題：

需要在scrapy的spiders同層目錄，新建csv_item_exporter.py文件內(nèi)容如下（文件名可改，目錄定死）

from scrapy.conf import settings
from scrapy.contrib.exporter import CsvItemExporter

class MyProjectCsvItemExporter(CsvItemExporter):
def?init(self, *args, **kwargs):
delimiter = settings.get(‘CSV_DELIMITER', ‘,')
kwargs[‘delimiter'] = delimiter
fields_to_export = settings.get(‘FIELDS_TO_EXPORT', [])
if fields_to_export :
kwargs[‘fields_to_export'] = fields_to_export
super(MyProjectCsvItemExporter, self).init(*args, **kwargs)

2)在settings.py中新增以下內(nèi)容

#定義輸出格式
FEED_EXPORTERS = {
‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter',
}
#指定csv輸出字段的順序
FIELDS_TO_EXPORT = [
‘name',
‘title',
‘info'
]
#指定分隔符
CSV_DELIMITER = ‘,'

設(shè)定完畢，執(zhí)行scrapy crawl spider -o spider.csv的時候，字段就按順序來了