Scrapy框架基本命令與settings.py設(shè)置
本文實(shí)例講述了Scrapy框架基本命令與settings.py設(shè)置。分享給大家供大家參考,具體如下:
Scrapy框架基本命令
1.創(chuàng)建爬蟲項(xiàng)目
scrapy startproject [項(xiàng)目名稱]
2.創(chuàng)建爬蟲文件
scrapy genspider +文件名+網(wǎng)址
3.運(yùn)行(crawl)
scrapy crawl 爬蟲名稱 # -o output 輸出數(shù)據(jù)到文件 scrapy crawl [爬蟲名稱] -o zufang.json scrapy crawl [爬蟲名稱] -o zufang.csv
4.check檢查錯(cuò)誤
scrapy check
5.list返回項(xiàng)目所有spider
scrapy list
6.view 存儲(chǔ)、打開網(wǎng)頁
scrapy view http://www.baidu.com
7.scrapy shell, 進(jìn)入終端
scrapy shell https://www.baidu.com
8.scrapy runspider
scrapy runspider zufang_spider.py
Scrapy框架: settings.py設(shè)置
# -*- coding: utf-8 -*- # Scrapy settings for maitian project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://doc.scrapy.org/en/latest/topics/settings.html # https://doc.scrapy.org/en/latest/topics/downloader-middleware.html # https://doc.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'maitian' SPIDER_MODULES = ['maitian.spiders'] NEWSPIDER_MODULE = 'maitian.spiders' #不能批量設(shè)置 # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'maitian (+http://www.yourdomain.com)' #默認(rèn)遵守robots協(xié)議 # Obey robots.txt rules ROBOTSTXT_OBEY = False #設(shè)置日志文件 LOG_FILE="maitian.log" #日志等級(jí)分為5種:1.DEBUG 2.INFO 3.Warning 4.ERROR 5.CRITICAL #等級(jí)越高 輸出的日志越少 # LOG_LEVEL="INFO" #scrapy設(shè)置最大并發(fā)數(shù) 默認(rèn)16 # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS = 32 #設(shè)置批量延遲請(qǐng)求16 等待3秒再發(fā)16 秒 # Configure a delay for requests for the same website (default: 0) # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 #cookie 不生效 默認(rèn)是True # Disable cookies (enabled by default) #COOKIES_ENABLED = False #遠(yuǎn)程 # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False #加載默認(rèn)的請(qǐng)求頭 # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} #爬蟲中間件 # Enable or disable spider middlewares # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'maitian.middlewares.MaitianSpiderMiddleware': 543, #} #下載中間件 # Enable or disable downloader middlewares # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'maitian.middlewares.MaitianDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://doc.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} #在配置文件 開啟管道 #優(yōu)先級(jí)的范圍 0--1000;值越小 優(yōu)先級(jí)越高 # Configure item pipelines # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html #ITEM_PIPELINES = { # 'maitian.pipelines.MaitianPipeline': 300, #} # Enable and configure the AutoThrottle extension (disabled by default) # See https://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
更多相關(guān)內(nèi)容可查看本站專題:《Python Socket編程技巧總結(jié)》、《Python正則表達(dá)式用法總結(jié)》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》、《Python入門與進(jìn)階經(jīng)典教程》及《Python文件與目錄操作技巧匯總》
希望本文所述對(duì)大家基于Scrapy框架的Python程序設(shè)計(jì)有所幫助。
相關(guān)文章
python數(shù)據(jù)擬合之scipy.optimize.curve_fit解讀
這篇文章主要介紹了python數(shù)據(jù)擬合之scipy.optimize.curve_fit解讀,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。如有錯(cuò)誤或未考慮完全的地方,望不吝賜教2022-12-12Python中filter與lambda的結(jié)合使用詳解
今天小編就為大家分享一篇Python中filter與lambda的結(jié)合使用詳解,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過來看看吧2019-12-12Windows環(huán)境下如何使用Pycharm運(yùn)行sh文件
這篇文章主要介紹了Windows環(huán)境下如何使用Pycharm運(yùn)行sh文件,本文給大家介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2023-02-02Python實(shí)現(xiàn)串口通信(pyserial)過程解析
這篇文章主要介紹了Python實(shí)現(xiàn)串口通信(pyserial)過程解析,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-09-09Python?turtle.shape()用法及實(shí)戰(zhàn)案例
turtle是Python自帶的一個(gè)小型的繪圖庫,它可以幫助我們快速地繪制簡單的圖形,這篇文章主要給大家介紹了關(guān)于Python?turtle.shape()用法及實(shí)戰(zhàn)案例的相關(guān)資料,需要的朋友可以參考下2024-03-03