Python實(shí)現(xiàn)多線程爬表情包詳解

更新時(shí)間：2021年11月25日 16:37:54 作者：魔王不會(huì)哭

這篇文章主要介紹了Python多線程爬表情包,本文通過(guò)實(shí)例代碼給大家介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值，需要的朋友可以參考下

課程亮點(diǎn)

系統(tǒng)分析目標(biāo)網(wǎng)頁(yè)

html標(biāo)簽數(shù)據(jù)解析方法

海量圖片數(shù)據(jù)一鍵保存

環(huán)境介紹

python 3.8

pycharm

模塊使用

requests >>> pip install requests

parsel >>> pip install parsel

time 時(shí)間模塊記錄運(yùn)行時(shí)間

流程

一. 分析我們想要的數(shù)據(jù)內(nèi)容是可以從哪里獲取

表情包 >>> 圖片url地址以及圖片名字

對(duì)于開發(fā)者工具的使用 >>>

二. 代碼實(shí)現(xiàn)步驟

1.發(fā)送請(qǐng)求

確定一下發(fā)送請(qǐng)求 url地址

請(qǐng)求方式是什么 get請(qǐng)求方式 post請(qǐng)求方式

請(qǐng)求頭參數(shù) : 防盜鏈 cookie …

2.獲取數(shù)據(jù)

獲取服務(wù)器返回的數(shù)據(jù)內(nèi)容

response.text 獲取文本數(shù)據(jù)

response.json() 獲取json字典數(shù)據(jù)

response.content 獲取二進(jìn)制數(shù)據(jù) 保存圖片/音頻/視頻/特定格式文件內(nèi)容都是獲取二進(jìn)制數(shù)據(jù)內(nèi)容

3.解析數(shù)據(jù)

提取我們想要的數(shù)據(jù)內(nèi)容

I. 可以直接解析處理

II. json字典數(shù)據(jù) 鍵值對(duì)取值

III. re正則表達(dá)式

IV. css選擇器

V. xpath

4.保存數(shù)據(jù)

文本

csv

數(shù)據(jù)庫(kù)

本地文件夾

導(dǎo)入模塊

import requests  # 數(shù)據(jù)請(qǐng)求模塊 第三方模塊 pip install requests
import parsel  # 數(shù)據(jù)解析模塊 第三方模塊 pip install parsel
import re  # 正則表達(dá)式模塊
import time  # 時(shí)間模塊
import concurrent.futures

單線程爬取10頁(yè)數(shù)據(jù)

1. 發(fā)送請(qǐng)求

start_time = time.time()

for page in range(1, 11):
    url = f'https://fabiaoqing.com/biaoqing/lists/page/{page}html'
     headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
     }
     response = requests.get(url=url, headers=headers)
     # <Response [200]> response 對(duì)象 200狀態(tài)碼 表示請(qǐng)求成功

2. 獲取數(shù)據(jù), 獲取文本數(shù)據(jù) / 網(wǎng)頁(yè)源代碼

# 在開發(fā)者工具上面 元素面板 看到有相應(yīng)標(biāo)簽數(shù)據(jù), 但是我發(fā)送請(qǐng)求之后 沒(méi)有這樣的數(shù)據(jù)返回
# 我們要提取數(shù)據(jù), 要根據(jù)服務(wù)器返回?cái)?shù)據(jù)內(nèi)容
# xpath 解析方法 parsel 解析模塊  parsel這個(gè)模塊里面就可以調(diào)用xpath解析方法
# print(response.text)

3. 解析數(shù)據(jù)

# 解析速度 bs4 解析速度會(huì)慢一些 如果你想要對(duì)于字符串?dāng)?shù)據(jù)內(nèi)容 直接取值 只能正則表達(dá)式
     selector = parsel.Selector(response.text) # 把獲取下來(lái)html字符串?dāng)?shù)據(jù)內(nèi)容 轉(zhuǎn)成 selector 對(duì)象
     title_list = selector.css('.ui.image.lazy::attr(title)').getall()
     img_list = selector.css('.ui.image.lazy::attr(data-original)').getall()
# 把獲取下來(lái)的這兩個(gè)列表 提取里面元素 一一提取出來(lái)
# 提取列表元素 for循環(huán) 遍歷
     for title, img_url in zip(title_list, img_list):

4. 保存數(shù)據(jù)

# split() 字符串分割的方法 根據(jù)列表索引位置取值
# img_name_1 = img_url[-3:] # 通過(guò)字符串?dāng)?shù)據(jù) 進(jìn)行切片
# 從左往右 索引位置 是從 0 開始 從右往左 是 -1開始
         # print(title, img_url)
         title = re.sub(r'[\/:*?"<>|\n]', '_', title)
         # 名字太長(zhǎng) 報(bào)錯(cuò)
         img_name = img_url.split('.')[-1]   # 通過(guò)split() 字符串分割的方法 根據(jù)列表索引位置取值
         img_content = requests.get(url=img_url).content # 獲取圖片的二進(jìn)制數(shù)據(jù)內(nèi)容
         with open('img\\' + title + '.' + img_name, mode='wb') as f:
             f.write(img_content)
         print(title)

多線程爬取10頁(yè)數(shù)據(jù)

def get_response(html_url):
    """發(fā)送請(qǐng)求"""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response

def get_img_info(html_url):
    """獲取圖片url地址 以及 圖片名字"""
    response = get_response(html_url)
    selector = parsel.Selector(response.text)  # 把獲取下來(lái)html字符串?dāng)?shù)據(jù)內(nèi)容 轉(zhuǎn)成 selector 對(duì)象
    title_list = selector.css('.ui.image.lazy::attr(title)').getall()
    img_list = selector.css('.ui.image.lazy::attr(data-original)').getall()
    zip_data = zip(title_list, img_list)
    return zip_data

def save(title, img_url):
    """保存數(shù)據(jù)"""
    title = re.sub(r'[\/:*?"<>|\n]', '_', title)
    # 名字太長(zhǎng) 報(bào)錯(cuò)
    img_name = img_url.split('.')[-1]  # 通過(guò)split() 字符串分割的方法 根據(jù)列表索引位置取值
    img_content = requests.get(url=img_url).content  # 獲取圖片的二進(jìn)制數(shù)據(jù)內(nèi)容
    with open('img\\' + title + '.' + img_name, mode='wb') as f:
        f.write(img_content)
    print(title)

多進(jìn)程爬取10頁(yè)數(shù)據(jù)

def main(html_url):
    zip_data = get_img_info(html_url)
    for title, img_url in zip_data:
        save(title, img_url)

if __name__ == '__main__':
    start_time = time.time()
    exe = concurrent.futures.ThreadPoolExecutor(max_workers=10)
    for page in range(1, 11):
        # 1. 發(fā)送請(qǐng)求
        url = f'https://fabiaoqing.com/biaoqing/lists/page/{page}html'
        exe.submit(main, url)
    exe.shutdown()
    end_time = time.time()
    use_time = int(end_time - start_time)
    print('程序耗時(shí): ', use_time)

單線程爬取10頁(yè)數(shù)據(jù) 61秒時(shí)間

多線程爬取10頁(yè)數(shù)據(jù) 19秒時(shí)間 >>> 13

多進(jìn)程爬取10頁(yè)數(shù)據(jù) 21秒時(shí)間 >>> 18

到此這篇關(guān)于Python實(shí)現(xiàn)多線程爬表情包詳解的文章就介紹到這了,更多相關(guān)Python 多線程爬表情包內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫(kù)

CMS

常用工具

Python實(shí)現(xiàn)多線程爬表情包詳解

目錄

課程亮點(diǎn)

環(huán)境介紹

模塊使用

流程

一. 分析我們想要的數(shù)據(jù)內(nèi)容是可以從哪里獲取

二. 代碼實(shí)現(xiàn)步驟

導(dǎo)入模塊

單線程爬取10頁(yè)數(shù)據(jù)

多進(jìn)程爬取10頁(yè)數(shù)據(jù)

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

Python實(shí)現(xiàn)多線程爬表情包詳解

目錄

課程亮點(diǎn)

環(huán)境介紹

模塊使用

流程

一. 分析我們想要的數(shù)據(jù)內(nèi)容 是可以從哪里獲取

二. 代碼實(shí)現(xiàn)步驟

導(dǎo)入模塊

單線程爬取10頁(yè)數(shù)據(jù)

多進(jìn)程爬取10頁(yè)數(shù)據(jù)

相關(guān)文章

最新評(píng)論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

一. 分析我們想要的數(shù)據(jù)內(nèi)容是可以從哪里獲取