快捷導(dǎo)航

Nginx服務(wù)器屏蔽與禁止屏蔽網(wǎng)絡(luò)爬蟲(chóng)的方法

更新時(shí)間：2019年03月16日 10:35:27 作者：CODETC

今天小編就為大家分享一篇關(guān)于Nginx服務(wù)器屏蔽與禁止屏蔽網(wǎng)絡(luò)爬蟲(chóng)的方法，小編覺(jué)得內(nèi)容挺不錯(cuò)的，現(xiàn)在分享給大家，具有很好的參考價(jià)值，需要的朋友一起跟隨小編來(lái)看看吧

每個(gè)網(wǎng)站通常都會(huì)遇到很多非搜索引擎的爬蟲(chóng)，這些爬蟲(chóng)大部分都是用于內(nèi)容采集或是初學(xué)者所寫，它們和搜索引擎的爬蟲(chóng)不一樣，沒(méi)有頻率控制，往往會(huì)消耗大量服務(wù)器資源，導(dǎo)致帶寬白白浪費(fèi)了。

其實(shí)Nginx可以非常容易地根據(jù)User-Agent過(guò)濾請(qǐng)求，我們只需要在需要URL入口位置通過(guò)一個(gè)簡(jiǎn)單的正則表達(dá)式就可以過(guò)濾不符合要求的爬蟲(chóng)請(qǐng)求：

location / {
  if ($http_user_agent ~* "python|curl|java|wget|httpclient|okhttp") {
    return 503;
  }
  # 其它正常配置
  ...
}

注意：變量$http_user_agent是一個(gè)可以直接在location中引用的Nginx變量。~*表示不區(qū)分大小寫的正則匹配，通過(guò)python就可以過(guò)濾掉80%的Python爬蟲(chóng)。

Nginx中禁止屏蔽網(wǎng)絡(luò)爬蟲(chóng)

server { 
    listen    80; 
    server_name www.xxx.com; 
    #charset koi8-r; 
    #access_log logs/host.access.log main; 
    #location / { 
    #  root  html; 
    #  index index.html index.htm; 
    #} 
  if ($http_user_agent ~* "qihoobot|Baiduspider|Googlebot|Googlebot-Mobile|Googlebot-Image|Mediapartners-Google|Adsbot-Google|Feedfetcher-Google|Yahoo! Slurp|Yahoo! Slurp China|YoudaoBot|Sosospider|Sogou spider|Sogou web spider|MSNBot|ia_archiver|Tomato Bot") { 
        return 403; 
    } 
  location ~ ^/(.*)$ { 
        proxy_pass http://localhost:8080; 
    proxy_redirect     off; 
    proxy_set_header    Host $host; 
    proxy_set_header    X-Real-IP $remote_addr; 
    proxy_set_header    X-Forwarded-For  $proxy_add_x_forwarded_for; 
    client_max_body_size  10m; 
    client_body_buffer_size 128k; 
    proxy_connect_timeout  90; 
    proxy_send_timeout   90; 
    proxy_read_timeout   90; 
    proxy_buffer_size    4k; 
    proxy_buffers      4 32k; 
    proxy_busy_buffers_size 64k; 
    proxy_temp_file_write_size 64k; 
  } 
    #error_page 404       /404.html; 
    # redirect server error pages to the static page /50x.html 
    # 
    error_page  500 502 503 504 /50x.html; 
    location = /50x.html { 
      root  html; 
    } 
    # proxy the PHP scripts to Apache listening on 127.0.0.1:80 
    # 
    #location ~ \.php$ { 
    #  proxy_pass  http://127.0.0.1; 
    #} 
    # pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000 
    # 
    #location ~ \.php$ { 
    #  root      html; 
    #  fastcgi_pass  127.0.0.1:9000; 
    #  fastcgi_index index.php; 
    #  fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name; 
    #  include    fastcgi_params; 
    #} 
    # deny access to .htaccess files, if Apache's document root 
    # concurs with nginx's one 
    # 
    #location ~ /\.ht { 
    #  deny all; 
    #} 
  }

可以用 curl 測(cè)試一下