Python實現(xiàn)IP代理批量采集的示例代碼

更新時間：2022年09月29日 10:16:46 作者：小圓-

這篇文章主要為大家詳細介紹了如何利用Python實現(xiàn)IP代理批量采集，并檢測代理是否可用。文中的示例代碼講解詳細，需要的可以參考一下

開發(fā)環(huán)境

python 3.8

pycharm

模塊使用

import requests —> 需要安裝 pip install requests

import parsel —> 需要安裝 pip install parsel 解析數(shù)據(jù)模塊

如果安裝python第三方模塊:

win + R 輸入 cmd 點擊確定, 輸入安裝命令 pip install 模塊名 (pip install requests) 回車
在pycharm中點擊Terminal(終端) 輸入安裝命令

IP代理: 采集網(wǎng)站數(shù)據(jù), 采集比較快, 你被封IP <一段時間內(nèi)容不能訪問這個網(wǎng)站>

基本流程（思路）

一. 數(shù)據(jù)來源分析

你要先分析, 你想要數(shù)據(jù)是請求那個url地址可以得到…

通過開發(fā)者工具抓包分析, 分析我們想要數(shù)據(jù)來源

I. F12或者鼠標(biāo)右鍵點檢查選擇network 刷新網(wǎng)頁

II. 分析數(shù)據(jù)內(nèi)容 <IP 以及端口>來自于哪里

通過開發(fā)者工具關(guān)鍵字搜索數(shù)據(jù)來源找到相對應(yīng)的數(shù)據(jù)包

二. 代碼實現(xiàn)步驟過程

爬蟲基本四大步驟

發(fā)送請求, 模擬瀏覽器對于分析得到url地址發(fā)送請求 https://free.kuaidaili.com/free/inha/1/

獲取數(shù)據(jù), 獲取服務(wù)器返回響應(yīng)數(shù)據(jù) —> 開發(fā)者工具里面看到 response

解析數(shù)據(jù), 提取我們想要數(shù)據(jù)內(nèi)容

保存數(shù)據(jù), 我們想要數(shù)據(jù)內(nèi)容保存本地

代碼

# 導(dǎo)入數(shù)據(jù)請求模塊 ---> 第三方模塊, 需要安裝 在cmd里面 pip install requests
import requests
# 導(dǎo)入數(shù)據(jù)解析模塊 ---> 第三方模塊, 需要安裝 在cmd里面 pip install parsel
import parsel
# 導(dǎo)入json模塊 ---> 內(nèi)置模塊 不需要安裝
import json

# 1. 發(fā)送請求, 模擬瀏覽器對于分析得到url地址發(fā)送請求
proxies_list = []
proxies_list_1 = []
# 請求url地址
for page in range(1, 11):
    url = f'https://www.boc.cn/sourcedb/whpj/index_{page}.html'
    """
    headers請求頭, 模擬偽裝瀏覽器去發(fā)送請求
        不加headers相當(dāng)于裸奔 ----> 告訴服務(wù)器, 我是爬蟲 我是爬蟲~ 你來抓我~
        加什么東西, 在哪加 ---> 開發(fā)者工具里面 復(fù)制 ua
    """
    headers = {
        # User-Agent 用戶代理 表示瀏覽器基本身份標(biāo)識
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
    }
    # 發(fā)送請求 ---> <Response [200]> 響應(yīng)對象, 200狀態(tài)碼 表示請求成功
    response = requests.get(url=url, headers=headers)
    # 2. 獲取數(shù)據(jù), 獲取服務(wù)器返回響應(yīng)數(shù)據(jù) print(response.text)
    """
    3. 解析數(shù)據(jù), 提取我們想要數(shù)據(jù)內(nèi)容
        解析方法:
            re正則: 對于字符串?dāng)?shù)據(jù)進行提取
            css: 根據(jù)標(biāo)簽屬性內(nèi)容提取
            xpath: 根據(jù)標(biāo)簽節(jié)點提取
    """
    # 轉(zhuǎn)換數(shù)據(jù)類型 response.text<字符串?dāng)?shù)據(jù)>  <Selector xpath=None data='<html>\n<head>\n<meta http-equiv="X-UA-...'>
    selector = parsel.Selector(response.text)
    # 獲取tr標(biāo)簽  ---> 返回列表 列表里面元素是 Selector對象
    trs = selector.css('#list table tbody tr')
    trs_1 = selector.xpath('//*[@id="list"]/table/tbody/tr')
    # for循環(huán) 一個一個提取tr標(biāo)簽
    for tr in trs:
        # 提取ip號 td:nth-child(1)::text 獲取第一個td標(biāo)簽里面文本數(shù)據(jù)
        ip_num = tr.css('td:nth-child(1)::text').get()
        # ip_num_1 = tr.xpath('td[1]/text()').get()
        ip_port = tr.css('td:nth-child(2)::text').get()
        """
        IP代理結(jié)構(gòu)是什么樣子的?
         proxies_dict = {
                    "http": "http://" + ip:端口,
                    "https": "http://" + ip:端口,
                }
        """
        proxies_dict = {
            "http": "http://" + ip_num + ':' + ip_port,
            "https": "https://" + ip_num + ':' + ip_port,
        }
        proxies_list_1.append(proxies_dict)
        # 檢測IP代理是否可用  用這個代理去請求一下網(wǎng)站就好了
        try:
            response_1 = requests.get(url='https://www.baidu.com/', proxies=proxies_dict, timeout=1)
            if response_1.status_code == 200:
                proxies_list.append(proxies_dict)
                print('代理可以使用: ', proxies_dict)
                # 保存代理到文本
                with open('代理.txt', mode='a', encoding='utf-8') as f:
                    f.write(json.dumps(proxies_dict))
                    f.write('\n')
        except:
            print('當(dāng)前代理:', proxies_dict, '請求超時, 檢測不合格')

print('===' * 50)
print('一共獲取到:', len(proxies_list_1))
print('可以使用代理: ', len(proxies_list))
print(proxies_list)

到此這篇關(guān)于Python實現(xiàn)IP代理批量采集的示例代碼的文章就介紹到這了,更多相關(guān)Python采集IP代理內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: