快捷導(dǎo)航

Python爬蟲(chóng)使用代理IP的實(shí)現(xiàn)

更新時(shí)間：2019年10月27日 16:32:47 作者：Steven·簡(jiǎn)談

這篇文章主要介紹了Python爬蟲(chóng)使用代理IP的實(shí)現(xiàn)，文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來(lái)一起學(xué)習(xí)學(xué)習(xí)吧

使用爬蟲(chóng)時(shí)，如果目標(biāo)網(wǎng)站對(duì)訪問(wèn)的速度或次數(shù)要求較高，那么你的 IP 就很容易被封掉，也就意味著在一段時(shí)間內(nèi)無(wú)法再進(jìn)行下一步的工作。這時(shí)候代理 IP 能夠給我們帶來(lái)很大的便利，不管網(wǎng)站怎么封，只要能找到一個(gè)新的代理 IP 就可以繼續(xù)進(jìn)行下一步的研究。

目前很多網(wǎng)站都提供了一些免費(fèi)的代理 IP 供我們使用，當(dāng)然付費(fèi)的會(huì)更好用一點(diǎn)。本文除了展示怎樣使用代理 IP，也正好體驗(yàn)一下前面文章中搭建的代理 IP 池，不知道的可以點(diǎn)擊這里：Python搭建代理IP池（一）- 獲取 IP。只要訪問(wèn)代理池提供的接口就可以獲取到代理 IP 了，接下來(lái)就看怎樣使用吧！

測(cè)試的網(wǎng)址是：http://httpbin.org/get，訪問(wèn)該站點(diǎn)可以得到請(qǐng)求的一些相關(guān)信息，其中 origin 字段就是客戶端的 IP，根據(jù)它來(lái)判斷代理是否設(shè)置成功，也就是是否成功偽裝了IP

獲取 IP

代理池使用 Flask 提供了獲取的接口：http://localhost:5555/random

只要訪問(wèn)這個(gè)接口再返回內(nèi)容就可以拿到 IP 了

Urllib

先看一下 Urllib 的代理設(shè)置方法：

from urllib.error import URLError
import urllib.request
from urllib.request import ProxyHandler, build_opener

# 獲取IP
ip_response = urllib.request.urlopen("http://localhost:5555/random")
ip = ip_response.read().decode('utf-8')

proxy_handler = ProxyHandler({
  'http': 'http://' + ip,
  'https': 'https://' + ip
})
opener = build_opener(proxy_handler)
try:
  response = opener.open('http://httpbin.org/get')
  print(response.read().decode('utf-8'))
except URLError as e:
  print(e.reason)

運(yùn)行結(jié)果：

{
 "args": {},
 "headers": {
  "Accept-Encoding": "identity",
  "Host": "httpbin.org",
  "User-Agent": "Python-urllib/3.7"
 },
 "origin": "108.61.201.231, 108.61.201.231",
 "url": "https://httpbin.org/get"
}

Urllib 使用 ProxyHandler 設(shè)置代理，參數(shù)是字典類型，鍵名為協(xié)議類型，鍵值是代理，代理前面需要加上協(xié)議，即 http 或 https，當(dāng)請(qǐng)求的鏈接是 http 協(xié)議的時(shí)候，它會(huì)調(diào)用 http 代理，當(dāng)請(qǐng)求的鏈接是 https 協(xié)議的時(shí)候，它會(huì)調(diào)用https代理，所以此處生效的代理是：http://108.61.201.231 和 https://108.61.201.231

ProxyHandler 對(duì)象創(chuàng)建之后，再利用 build_opener() 方法傳入該對(duì)象來(lái)創(chuàng)建一個(gè) Opener，這樣就相當(dāng)于此 Opener 已經(jīng)設(shè)置好代理了，直接調(diào)用它的 open() 方法即可使用此代理訪問(wèn)鏈接

Requests

Requests 的代理設(shè)置只需要傳入 proxies 參數(shù)：

import requests

# 獲取IP
ip_response = requests.get("http://localhost:5555/random")
ip = ip_response.text

proxies = {
  'http': 'http://' + ip,
  'https': 'https://' + ip,
}
try:
  response = requests.get('http://httpbin.org/get', proxies=proxies)
  print(response.text)
except requests.exceptions.ConnectionError as e:
  print('Error', e.args)

運(yùn)行結(jié)果：

{
 "args": {},
 "headers": {
  "Accept": "*/*",
  "Accept-Encoding": "gzip, deflate",
  "Host": "httpbin.org",
  "User-Agent": "python-requests/2.21.0"
 },
 "origin": "47.90.28.54, 47.90.28.54",
 "url": "https://httpbin.org/get"
}

Requests 只需要構(gòu)造代理字典然后通過(guò) proxies 參數(shù)即可設(shè)置代理，比較簡(jiǎn)單

Selenium

import requests
from selenium import webdriver
import time

# 借助requests庫(kù)獲取IP
ip_response = requests.get("http://localhost:5555/random")
ip = ip_response.text

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=http://' + ip)
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get('http://httpbin.org/get')
time.sleep(5)

運(yùn)行結(jié)果：