Python實(shí)現(xiàn)抖音熱搜定時(shí)爬取功能

更新時(shí)間：2022年03月15日 15:05:28 作者：Python丁小杰

這篇文章主要為大家介紹了利用Python制作的一個(gè)新摸魚神器，可以實(shí)現(xiàn)抖音熱搜定時(shí)爬取。文中的實(shí)現(xiàn)步驟講解詳細(xì)，感興趣的可以試一試

抖音熱搜榜

鏈接：https://tophub.today/n/K7GdaMgdQy

整個(gè)熱榜共50條數(shù)據(jù)，本次爬取的內(nèi)容：排名、熱度、標(biāo)題、鏈接。

requests 爬取

requests 是一種非常簡單的方法，由于該頁面沒有反爬措施，所以直接get 請求頁面即可。

import?requests
import?pandas?as?pd

headers?=?{
????'User-Agent':?'Mozilla/5.0?(Windows?NT?10.0;?Win64;?x64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/94.0.4606.54?Safari/537.36'
}
url?=?'https://tophub.today/n/K7GdaMgdQy'
page_text?=?requests.get(url=url,?headers=headers).text
page_text

可以看到，只需要幾行代碼，數(shù)據(jù)就很輕松地獲取到了。

selenium 爬取

將selenium設(shè)置為無頭瀏覽器，打開指定url獲取頁面數(shù)據(jù)。

from?selenium?import?webdriver

option?=?webdriver.ChromeOptions()
option.add_argument('--headless')

driver?=?webdriver.Chrome(options=option)

url?=?'https://tophub.today/n/K7GdaMgdQy'
driver.get(url)

page_text?=?driver.page_source

兩種爬取方法都能夠成功獲取到數(shù)據(jù)，但requests相對簡潔，整個(gè)代碼運(yùn)行速度也更快，如果頁面數(shù)據(jù)不是動態(tài)加載的話，用requests相對方便。

數(shù)據(jù)解析

現(xiàn)在用lxml庫解析我們爬取的數(shù)據(jù)，并保存到excel中。

tree?=?etree.HTML(page_text)

tr_list?=?tree.xpath(
????'//*[@id="page"]/div[2]/div[2]/div[1]/div[2]/div/div[1]/table/tbody/tr')

df?=?pd.DataFrame(columns=['排名',?'熱度',?'標(biāo)題',?'鏈接'])
for?index,?tr?in?enumerate(tr_list):
????hot?=?tr.xpath('./td[3]/text()')[0]
????title?=?tr.xpath('./td[2]/a/text()')[0]
????article_url?=?tr.xpath('./td[2]/a/@href')[0]
????df?=?df.append({
????????'排名':?index?+?1,
????????'熱度':?hot,
????????'標(biāo)題':?title,
????????'鏈接':?article_url},?ignore_index=True)
df['鏈接']?=?'https://tophub.today'?+?df['鏈接']
df

運(yùn)行結(jié)果