快捷導(dǎo)航

python中的異步爬蟲詳解

更新時(shí)間：2023年08月09日 10:53:12 作者：南岸青梔*

這篇文章主要介紹了python中的異步爬蟲詳解,所謂的異步異步?IO，就是發(fā)起一個(gè)?IO?阻塞的操作，但是不用等到它結(jié)束，可以在它執(zhí)行?IO?的過程中繼續(xù)做別的事情，當(dāng)?IO?執(zhí)行完畢之后會(huì)收到它的通知,需要的朋友可以參考下

python異步爬蟲

基本概念

目的:在爬蟲中使用異步實(shí)現(xiàn)高性能的數(shù)據(jù)爬取操作。

異步爬蟲的方式:

多線程，多進(jìn)程(不建議) :
- 好處:可以為相關(guān)阻塞的操作單獨(dú)開啟線程或者進(jìn)程，阻塞操作就可以異步執(zhí)行。
- 弊端:無法無限制的開啟多線程或者多進(jìn)程。
線程池、進(jìn)程池(適當(dāng)) :
- 好處:我們可以降低系統(tǒng)對(duì)進(jìn)程或者線程創(chuàng)建和銷毀的一個(gè)頻率，從而很好的降低系統(tǒng)的開銷。
- 弊端:池中線程或進(jìn)程的數(shù)量是有上限。

線程池的基本使用

# import time
# #單線程串行方式執(zhí)行
# start_time = time.time()
# def get_page(str):
#     print('正在下載：',str)
#     time.sleep(2)
#     print('下載完成：',str)
#
# name_list = ['haha','lala','duoduo','anan']
#
# for i in range(len(name_list)):
#     get_page(name_list[i])
#
# end_time = time.time()
# print(end_time-start_time)
import time
from multiprocessing.dummy import Pool
#單線程串行方式執(zhí)行
start_time = time.time()
def get_page(str):
    print('正在下載：',str)
    time.sleep(2)
    print('下載完成：',str)
name_list = ['haha','lala','duoduo','anan']
pool = Pool(4)
pool.map(get_page,name_list)
end_time = time.time()
print(end_time-start_time)

效果圖

單線程串行方式

在這里插入圖片描述

線程池

在這里插入圖片描述

爬取網(wǎng)址：https://www.pearvideo.com/category_6

代碼

import requests,re,random
from lxml import etree
from multiprocessing.dummy import Pool
urls = [] #視頻地址和視頻名稱的字典
#獲取視頻假地址函數(shù)
def get_videoadd(detail_url,video_id):
    ajks_url = 'https://www.pearvideo.com/videoStatus.jsp'
    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
                      'Referer':detail_url
    }
    params = {
        'contId': video_id,
        'mrd': str(random.random())
    }
    video_json = requests.post(headers=header,url=ajks_url,params=params).json()
    return video_json['videoInfo']['videos']['srcUrl']
#獲取視頻數(shù)據(jù)和持久化存儲(chǔ)
def get_videoData(dic):
    right_url = dic['url']
    print(dic['name'],'start!')
    video_data = requests.get(url=right_url,headers=headers).content
    with open(dic['name'],'wb') as fp:
        fp.write(video_data)
    print(dic['name'],'over!')
if __name__ == '__main__':
    url = 'https://www.梨video.com/category_6'
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36'
    }
    page_text = requests.get(url=url,headers=headers).text
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//*[@id="listvideoListUl"]/li')
    for li in li_list:
        detail_url = 'https://www.pearvideo.com/'+li.xpath('./div/a/@href')[0]
        name = li.xpath('./div/a/div[2]/text()')[0]+'.mp4'
        #解析視頻ID
        video_id = detail_url.split('/')[-1].split('_')[-1]
        false_url = get_videoadd(detail_url,video_id)
        temp = false_url.split('/')[-1].split('-')[0]
        #拼接出正確的url
        right_url = false_url.replace(temp,'cont-'+str(video_id))
        dic = {
            'name':name,
            'url':right_url
        }
        urls.append(dic)
    #使用線程池
    pool = Pool(4)
    pool.map(get_videoData,urls)
    #子線程結(jié)束后關(guān)閉
    pool.close()
    #主線程關(guān)閉
    pool.join()

效果圖

在這里插入圖片描述