使用python?scrapy爬取天氣并導(dǎo)出csv文件
爬取xxx天氣
爬取網(wǎng)址:https://tianqi.2345.com/today-60038.htm
安裝
pip install scrapy
我使用的版本是scrapy 2.5
創(chuàng)建scray爬蟲項(xiàng)目
在命令行如下輸入命令
scrapy startproject name
name為項(xiàng)目名稱
如,scrapy startproject spider_weather
之后再輸入
scrapy genspider spider_name 域名
如,scrapy genspider changshu tianqi.2345.com
查看文件夾
- spider_weather
- spider
- __init__.py
- changshu.py
- __init__.py
- items.py
- middlewares.py
- pipelines.py
- settings.py
- scrapy.cfg
文件說(shuō)明
名稱 | 作用 |
---|---|
scrapy.cfg | 項(xiàng)目的配置信息,主要為Scrapy命令行工具提供一個(gè)基礎(chǔ)的配置信息。(真正爬蟲相關(guān)的配置信息在settings.py文件中) |
items.py | 設(shè)置數(shù)據(jù)存儲(chǔ)模板,用于結(jié)構(gòu)化數(shù)據(jù),如:Django的Model |
pipelines | 數(shù)據(jù)處理行為,如:一般結(jié)構(gòu)化的數(shù)據(jù)持久化 |
settings.py | 配置文件,如:遞歸的層數(shù)、并發(fā)數(shù),延遲下載等 |
spiders | 爬蟲目錄,如:創(chuàng)建文件,編寫爬蟲規(guī)則 |
開始爬蟲
1.在spiders文件夾里面對(duì)自己創(chuàng)建的爬蟲文件進(jìn)行數(shù)據(jù)爬取、如在此案例中的spiders/changshu.py
代碼演示如下
import scrapy class ChangshuSpider(scrapy.Spider): name = 'changshu' allowed_domains = ['tianqi.2345.com'] start_urls = ['https://tianqi.2345.com/today-60038.htm'] def parse(self, response): # 日期、天氣狀態(tài)、溫度、風(fēng)級(jí) # 利用xpath解析數(shù)據(jù)、不會(huì)xpath的同學(xué)可以去稍微學(xué)習(xí)一下,語(yǔ)法簡(jiǎn)單 dates = response.xpath('//a[@class="seven-day-item "]/em/text()').getall() states = response.xpath('//a[@class="seven-day-item "]/i/text()').getall() temps = response.xpath('//a[@class="seven-day-item "]/span[@class="tem-show"]/text()').getall() winds = response.xpath('//a[@class="seven-day-item "]/span[@class="wind-name"]/text()').getall() # 返回每條數(shù)據(jù) for date, state, temp, wind in zip(dates,states,temps,winds): yield { 'date' : date, 'state': state, 'temp': temp, 'wind': wind }
2.在settings.py文件中進(jìn)行配置
修改UA
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36'
修改機(jī)器爬蟲配置
ROBOTSTXT_OBEY = False
整個(gè)文件如下:
# Scrapy settings for spider_weather project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'spider_weather' SPIDER_MODULES = ['spider_weather.spiders'] NEWSPIDER_MODULE = 'spider_weather.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36' # Obey robots.txt rules ROBOTSTXT_OBEY = False # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'spider_weather.middlewares.SpiderWeatherSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'spider_weather.middlewares.SpiderWeatherDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # ITEM_PIPELINES = { # 'spider_weather.pipelines.SpiderWeatherPipeline': 300, # } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
3.然后在命令行中輸入如下代碼
scrapy crawl changshu -o weather.csv
注意:需要進(jìn)入spider_weather路徑下運(yùn)行
scrapy crawl 文件名 -o weather.csv(導(dǎo)出文件)
4.結(jié)果如下
補(bǔ)充:scrapy導(dǎo)出csv時(shí)字段的一些問(wèn)題
scrapy -o csv格式輸出的時(shí)候,發(fā)現(xiàn)輸出文件中字段的順序不是按照items.py中的順序,也不是爬蟲文件中寫入的順序,這樣導(dǎo)出的數(shù)據(jù)因?yàn)槟承┳侄巫兊貌缓每?,此外,?dǎo)出得csv文件不同的item之間被空行隔開,本文主要描述解決這些問(wèn)題的方法。
1.字段順序問(wèn)題:
需要在scrapy的spiders同層目錄,新建csv_item_exporter.py文件內(nèi)容如下(文件名可改,目錄定死)
from scrapy.conf import settings from scrapy.contrib.exporter import CsvItemExporter class MyProjectCsvItemExporter(CsvItemExporter): def?init(self, *args, **kwargs): delimiter = settings.get(‘CSV_DELIMITER', ‘,') kwargs[‘delimiter'] = delimiter fields_to_export = settings.get(‘FIELDS_TO_EXPORT', []) if fields_to_export : kwargs[‘fields_to_export'] = fields_to_export super(MyProjectCsvItemExporter, self).init(*args, **kwargs)
2)在settings.py中新增以下內(nèi)容
#定義輸出格式 FEED_EXPORTERS = { ‘csv': ‘project_name.spiders.csv_item_exporter.MyProjectCsvItemExporter', } #指定csv輸出字段的順序 FIELDS_TO_EXPORT = [ ‘name', ‘title', ‘info' ] #指定分隔符 CSV_DELIMITER = ‘,'
設(shè)定完畢,執(zhí)行scrapy crawl spider -o spider.csv的時(shí)候,字段就按順序來(lái)了
2.輸出csv有空行的問(wèn)題
此時(shí)你可能會(huì)發(fā)現(xiàn)csv文件中有空行,這是因?yàn)閟crapy默認(rèn)輸出時(shí),每個(gè)item之間的分隔符是空行
解決辦法:
在找到exporters.py的CsvItemExporter類,大概在215行中增加newline="",即可。
也可以繼承重寫CsvItemExporter類
總結(jié)
到此這篇關(guān)于使用python scrapy爬取天氣并導(dǎo)出csv文件的文章就介紹到這了,更多相關(guān)scrapy爬取天氣導(dǎo)出csv內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
python實(shí)現(xiàn)自動(dòng)生成C++代碼的代碼生成器
這篇文章介紹了python實(shí)現(xiàn)C++代碼生成器的方法,文中通過(guò)示例代碼介紹的非常詳細(xì)。對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2022-07-07python發(fā)送多人郵件沒(méi)有展示收件人問(wèn)題的解決方法
這篇文章主要為大家詳細(xì)介紹了python發(fā)送多人郵件沒(méi)有展示收件人問(wèn)題的解決方法,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2019-06-06Python操作SQLite數(shù)據(jù)庫(kù)的方法詳解
這篇文章主要介紹了Python操作SQLite數(shù)據(jù)庫(kù)的方法,較為詳細(xì)的分析了Python安裝sqlite數(shù)據(jù)庫(kù)模塊及針對(duì)sqlite數(shù)據(jù)庫(kù)的常用操作技巧,需要的朋友可以參考下2017-06-06python基于socket進(jìn)行端口轉(zhuǎn)發(fā)實(shí)現(xiàn)后門隱藏的示例
今天小編就為大家分享一篇python基于socket進(jìn)行端口轉(zhuǎn)發(fā)實(shí)現(xiàn)后門隱藏的示例,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-07-07Python?設(shè)計(jì)模式行為型訪問(wèn)者模式
這篇文章主要介紹了Python?設(shè)計(jì)模式行為型訪問(wèn)者模式,訪問(wèn)者模式即Visitor?Pattern,訪問(wèn)者模式,指作用于一個(gè)對(duì)象結(jié)構(gòu)體上的元素的操作,下文相關(guān)資料需要的小伙伴可以參考一下2022-02-02