快捷導(dǎo)航

Nginx反爬蟲策略，防止UA抓取網(wǎng)站

更新時間：2020年09月16日 10:22:16 作者：Mr.Yong

目前網(wǎng)絡(luò)上的爬蟲非常多，有對網(wǎng)站收錄有益的，比如百度蜘蛛（Baiduspider），也有不但不遵守robots規(guī)則對服務(wù)器造成壓力，還不能為網(wǎng)站帶來流量的無用爬蟲，為防止網(wǎng)站有可能會被別人爬，通過配置Nginx, 我們可以攔截大部分爬蟲

新增反爬蟲策略文件：

vim /usr/www/server/nginx/conf/anti_spider.conf

文件內(nèi)容

#禁止Scrapy等工具的抓取 
if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { 
   return 403; 
} 
#禁止指定UA及UA為空的訪問 
if ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) { 
   return 403;        
} 
#禁止非GET|HEAD|POST方式的抓取 
if ($request_method !~ ^(GET|HEAD|POST)$) { 
  return 403; 
}
#屏蔽單個IP的命令是
#deny 123.45.6.7
#封整個段即從123.0.0.1到123.255.255.254的命令
#deny 123.0.0.0/8
#封IP段即從123.45.0.1到123.45.255.254的命令
#deny 124.45.0.0/16
#封IP段即從123.45.6.1到123.45.6.254的命令是
#deny 123.45.6.0/24
# 以下IP皆為流氓
#deny 58.95.66.0/24;

配置使用

在站點的server中引入

# 反爬蟲  
include /usr/www/server/nginx/conf/anti_spider.conf

最后重啟nginx

校驗是否有效

模擬YYSpider

λ curl -X GET -I -A 'YYSpider' https://www.myong.top
HTTP/1.1 200 Connection established
HTTP/2 403
server: marco/2.11
date: Fri, 20 Mar 2020 08:48:50 GMT
content-type: text/html
content-length: 146
x-source: C/403
x-request-id: 3ed800d296a12ebcddc4d61c57500aa2

模擬百度Baiduspider

λ curl -X GET -I -A 'BaiduSpider' https://www.myong.top
HTTP/1.1 200 Connection established
HTTP/2 200
server: marco/2.11
date: Fri, 20 Mar 2020 08:49:47 GMT
content-type: text/html
vary: Accept-Encoding
x-source: C/200
last-modified: Wed, 18 Mar 2020 13:16:50 GMT
etag: "5e721f42-150ce"
x-request-id: e82999a78b7d7ea2e9ff18b6f1f4cc84

爬蟲常見的User-Agent

FeedDemon       內(nèi)容采集 
BOT/0.1 (BOT for JCE) sql注入 
CrawlDaddy      sql注入 
Java         內(nèi)容采集 
Jullo         內(nèi)容采集 
Feedly        內(nèi)容采集 
UniversalFeedParser  內(nèi)容采集 
ApacheBench      cc攻擊器 
Swiftbot       無用爬蟲 
YandexBot       無用爬蟲 
AhrefsBot       無用爬蟲 
YisouSpider      無用爬蟲（已被UC神馬搜索收購，此蜘蛛可以放開?。?
jikeSpider      無用爬蟲 
MJ12bot        無用爬蟲 
ZmEu phpmyadmin    漏洞掃描 
WinHttp        采集cc攻擊 
EasouSpider      無用爬蟲 
HttpClient      tcp攻擊 
Microsoft URL Control 掃描 
YYSpider       無用爬蟲 
jaunty        wordpress爆破掃描器 
oBot         無用爬蟲 
Python-urllib     內(nèi)容采集 
Indy Library     掃描 
FlightDeckReports Bot 無用爬蟲 
Linguee Bot      無用爬蟲

以上就是Nginx反爬蟲策略，防止UA抓取網(wǎng)站的詳細內(nèi)容，更多關(guān)于Nginx 反爬蟲的資料請關(guān)注腳本之家其它相關(guān)文章！

您可能感興趣的文章: