腳本之家服務(wù)器常用軟件

快捷導(dǎo)航

軟件下載

android MAC 驅(qū)動(dòng)下載字體下載 DLL

源碼下載

PHP ASP.NET ASP JSP

軟件編程

C# JAVA C 語言 Delphi Android

網(wǎng)絡(luò)編程

PHP ASP.NET ASP JavaScript

在線工具

CSS格式化 JS格式化 Html轉(zhuǎn)化為Js

數(shù)據(jù)庫

MYSQL MSSQL oracle DB2 MARIADB

CMS

PHPCMS DEDECMS 帝國CMS WordPress

常用工具

PHP開發(fā)工具 python Photoshop 必備軟件

Python實(shí)現(xiàn)采集網(wǎng)站ip代理并檢測是否可用

更新時(shí)間：2022年01月23日 10:03:04 作者：松鼠愛吃餅干

這篇文章主要介紹了如何利用Python爬蟲實(shí)現(xiàn)采集網(wǎng)站ip代理，并檢測IP代理是否可用。文中的示例代碼講解詳細(xì)，感興趣的可以試一試

開發(fā)環(huán)境

Python 3.8

Pycharm

模塊使用

requests >>> pip install requests

parsel >>> pip install parsel

代理ip結(jié)構(gòu)

proxies_dict = {
    "http": "http://" + ip:端口,
    "https": "http://" + ip:端口,
}

代碼實(shí)現(xiàn)步驟

1. 導(dǎo)入模塊

# 導(dǎo)入數(shù)據(jù)請求模塊
import requests  # 數(shù)據(jù)請求模塊 第三方模塊 pip install requests
# 導(dǎo)入 正則表達(dá)式模塊
import re  # 內(nèi)置模塊
# 導(dǎo)入數(shù)據(jù)解析模塊
import parsel  # 數(shù)據(jù)解析模塊 第三方模塊 pip install parsel  >>> 這個(gè)是scrapy框架核心組件

2. 發(fā)送請求

對于目標(biāo)網(wǎng)址發(fā)送請求 https://www.kuaidaili.com/free/

url = f'https://www.kuaidaili.com/free/inha/{page}/'  # 確定請求url地址
# 用requests模塊里面get 方法 對于url地址發(fā)送請求, 最后用response變量接收返回?cái)?shù)據(jù)
response = requests.get(url)

3. 獲取數(shù)據(jù)

獲取服務(wù)器返回響應(yīng)數(shù)據(jù)(網(wǎng)頁源代碼)

print(response.text)

4. 解析數(shù)據(jù)

提取我們想要的數(shù)據(jù)內(nèi)容

解析數(shù)據(jù)方式方法：

正則: 可以直接提取字符串?dāng)?shù)據(jù)內(nèi)容
xpath: 根據(jù)標(biāo)簽節(jié)點(diǎn) 提取數(shù)據(jù)內(nèi)容
css選擇器: 根據(jù)標(biāo)簽屬性提取數(shù)據(jù)內(nèi)容

哪一種方面用那種, 那是喜歡用那種

正則表達(dá)式提取數(shù)據(jù)內(nèi)容

正則提取數(shù)據(jù) re.findall() 調(diào)用模塊里面的方法

正則遇事不決 .*? 可以匹配任意字符(除了換行符\n以外) re.S

ip_list = re.findall('<td data-title="IP">(.*?)</td>', response.text, re.S)
port_list = re.findall('<td data-title="PORT">(.*?)</td>', response.text, re.S)
print(ip_list)
print(port_list)

css選擇器

css選擇器提取數(shù)據(jù) 需要把獲取下來html字符串?dāng)?shù)據(jù)(response.text) 進(jìn)行轉(zhuǎn)換

# #list > table > tbody > tr > td:nth-child(1)
# //*[@id="list"]/table/tbody/tr/td[1]
selector = parsel.Selector(response.text) # 把html 字符串?dāng)?shù)據(jù)轉(zhuǎn)成 selector 對象
ip_list = selector.css('#list tbody tr td:nth-child(1)::text').getall()
port_list = selector.css('#list tbody tr td:nth-child(2)::text').getall()
print(ip_list)
print(port_list)

xpath 提取數(shù)據(jù)

selector = parsel.Selector(response.text) # 把html 字符串?dāng)?shù)據(jù)轉(zhuǎn)成 selector 對象
ip_list = selector.xpath('//*[@id="list"]/table/tbody/tr/td[1]/text()').getall()
port_list = selector.xpath('//*[@id="list"]/table/tbody/tr/td[2]/text()').getall()

提取ip

for ip, port in zip(ip_list, port_list):
    # print(ip, port)
    proxy = ip + ':' + port
    proxies_dict = {
        "http": "http://" + proxy,
        "https": "http://" + proxy,
    }
    print(proxies_dict)

5. 檢測ip質(zhì)量

try:
    response = requests.get(url=url, proxies=proxies_dict, timeout=1)
    if response.status_code == 200:
        print('當(dāng)前代理IP: ', proxies_dict,  '可以使用')
        lis_1.append(proxies_dict)
except:
    print('當(dāng)前代理IP: ', proxies_dict,  '請求超時(shí), 檢測不合格')


print('獲取的代理IP數(shù)量: ', len(lis))
print('獲取可用的IP代理數(shù)量: ', len(lis_1))
print('獲取可用的IP代理: ', lis_1)