使用Selenium實現(xiàn)微博爬蟲(預(yù)登錄、展開全文、翻頁)

更新時間：2021年04月13日 08:37:31 作者：小粒子學(xué)code

這篇文章主要介紹了使用Selenium實現(xiàn)微博爬蟲(預(yù)登錄、展開全文、翻頁),文中通過示例代碼介紹的非常詳細，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值，需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

前言

在CSDN發(fā)的第一篇文章，時隔兩年，終于實現(xiàn)了爬微博的自由！本文可以解決微博預(yù)登錄、識別“展開全文”并爬取完整數(shù)據(jù)、翻頁設(shè)置等問題。由于剛接觸爬蟲，有部分術(shù)語可能用的不正確，請大家多指正！

一、區(qū)分動態(tài)爬蟲和靜態(tài)爬蟲

1、靜態(tài)網(wǎng)頁
靜態(tài)網(wǎng)頁是純粹的HTML，沒有后臺數(shù)據(jù)庫，不含程序，不可交互，體量較少，加載速度快。靜態(tài)網(wǎng)頁的爬取只需四個步驟：發(fā)送請求、獲取相應(yīng)內(nèi)容、解析內(nèi)容及保存數(shù)據(jù)。

2、動態(tài)網(wǎng)頁
動態(tài)網(wǎng)頁上的數(shù)據(jù)會隨時間及用戶交互發(fā)生變化，因此數(shù)據(jù)不會直接呈現(xiàn)在網(wǎng)頁源代碼中，數(shù)據(jù)將以Json的形式保存起來。因此，動態(tài)網(wǎng)頁比靜態(tài)網(wǎng)頁多了一步，即需渲染獲得相關(guān)數(shù)據(jù)。

3、區(qū)分動靜態(tài)網(wǎng)頁的方法
加載網(wǎng)頁后，點擊右鍵，選中“查看網(wǎng)頁源代碼”，如果網(wǎng)頁上的絕大多數(shù)字段都出現(xiàn)源代碼中，那么這就是靜態(tài)網(wǎng)頁，否則是動態(tài)網(wǎng)頁。

在這里插入圖片描述

二、動態(tài)爬蟲的兩種方法

1.逆向分析爬取動態(tài)網(wǎng)頁
適用于調(diào)度資源所對應(yīng)網(wǎng)址的數(shù)據(jù)為json格式，Javascript的觸發(fā)調(diào)度。主要步驟是獲取需要調(diào)度資源所對應(yīng)的網(wǎng)址-訪問網(wǎng)址獲得該資源的數(shù)據(jù)。（此處不詳細講解）

2.使用Selenium庫爬取動態(tài)網(wǎng)頁
使用Selenium庫，該庫使用JavaScript模擬真實用戶對瀏覽器進行操作。本案例將使用該方法。

三、安裝Selenium庫及下載瀏覽器補丁

1.Selenium庫使用pip工具進行安裝即可。
2.下載與Chrome瀏覽器版本匹配的瀏覽器補丁。
Step1：查看Chrome的版本

在這里插入圖片描述

Step2：去下載相應(yīng)版本的瀏覽器補丁。網(wǎng)址：http://npm.taobao.org/mirrors/chromedriver/
Step3：解壓文件，并將之放到與python.exe同一文件下

在這里插入圖片描述

四、頁面打開及預(yù)登錄

1.導(dǎo)入selenium包

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
import time
import pandas as pd

2.打開頁面

driver = webdriver.Chrome()    
print('準(zhǔn)備登陸Weibo.cn網(wǎng)站...')
#發(fā)送請求
driver.get("https://login.sina.com.cn/signup/signin.php")
wait = WebDriverWait(driver,5)
#重要：暫停1分鐘進行預(yù)登陸，此處填寫賬號密碼及驗證
time.sleep(60)

3.采用交互式運行，運行完上面兩段程序，會彈出一個框，這個框就是用來模擬網(wǎng)頁的交互。在這個框中完成登錄（包括填寫登錄名、密碼及短信驗證等）

在這里插入圖片描述

4.完成預(yù)登錄，則進入個人主頁

在這里插入圖片描述

五、關(guān)鍵詞搜索操作

1.定位上圖中的關(guān)鍵詞輸入框，并在框中輸入搜索對象，如“努力學(xué)習(xí)”

#使用selector去定位關(guān)鍵詞搜索框
s_input = driver.find_element_by_css_selector('#search_input')
#向搜索框中傳入字段
s_input.send_keys("努力學(xué)習(xí)")
#定位搜索鍵
confirm_btn = driver.find_element_by_css_selector('#search_submit')
#點擊
confirm_btn.click()

2.當(dāng)完成上步的代碼運行后，會彈出新的窗口，從個人主頁跳到微博搜索頁。但是driver仍在個人主頁，需要人為進行driver的移動，將之移動到微博搜索頁。

在這里插入圖片描述

3.使用switch_to.window()方法移位

#人為移動driver
driver.switch_to.window(driver.window_handles[1])

六、識別“展開全文”并爬取數(shù)據(jù)

1.了解每個元素的Selector，用以定位（重點在于唯一標(biāo)識性）

在這里插入圖片描述

2.使用Selector定位元素，并獲取相應(yīng)的數(shù)據(jù)

comment = []
username = []

#抓取節(jié)點：每個評論為一個節(jié)點（包括用戶信息、評論、日期等信息），如果一頁有20條評論，那么nodes的長度就為20
nodes = driver.find_elements_by_css_selector('div.card > div.card-feed > div.content')

#對每個節(jié)點進行循環(huán)操作
for i in range(0,len(nodes),1):
    #判斷每個節(jié)點是否有“展開全文”的鏈接
    flag = False
    try:
        nodes[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").is_displayed()
        flag = True  
    except:
        flag = False
    
    #如果該節(jié)點具有“展開全文”的鏈接，且該鏈接中的文字是“展開全文c”，那么點擊這個要素，并獲取指定位置的文本；否則直接獲取文本
    #（兩個條件需要同時滿足，因為該selector不僅標(biāo)識了展開全文，還標(biāo)識了其他元素，沒有做到唯一定位）
    if(flag and nodes[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").text.startswith('展開全文c')):
        nodes[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").click()
        comment.append(nodes[i].find_element_by_css_selector('p[node-type="feed_list_content_full"]').text)
    else:
        comment.append(nodes[i].find_element_by_css_selector('p[node-type="feed_list_content"]').text)
    username.append(nodes[i].find_element_by_css_selector("div.info>div:nth-child(2)>a").text)

七、設(shè)置翻頁

1.使用for循環(huán)實現(xiàn)翻頁，重點在于識別“下一頁”按鈕，并點擊它

for page in range(49):
    print(page)
    # 定位下一頁按鈕
    nextpage_button = driver.find_element_by_link_text('下一頁')
    #點擊按鍵
    driver.execute_script("arguments[0].click();", nextpage_button)
    wait = WebDriverWait(driver,5)
    #與前面類似
    nodes1 = driver.find_elements_by_css_selector('div.card > div.card-feed > div.content')
    for i in range(0,len(nodes1),1):
        flag = False
        try:
            nodes1[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").is_displayed()
            flag = True
        
        except:
            flag = False
        if (flag and nodes1[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").text.startswith('展開全文c')):
            nodes1[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").click()
            comment.append(nodes1[i].find_element_by_css_selector('p[node-type="feed_list_content_full"]').text)
        else:
            comment.append(nodes1[i].find_element_by_css_selector('p[node-type="feed_list_content"]').text)
        username.append(nodes1[i].find_element_by_css_selector("div.info>div:nth-child(2)>a").text)

八、保存數(shù)據(jù)

1.使用DataFrame保存字段

data = pd.DataFrame({'username':username,'comment':comment})

在這里插入圖片描述

2.導(dǎo)出到Excel

data.to_excel("weibo.xlsx")

九、完整代碼

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
import pandas as pd

'''打開網(wǎng)址，預(yù)登陸'''
driver = webdriver.Chrome()    
print('準(zhǔn)備登陸Weibo.cn網(wǎng)站...')
#發(fā)送請求
driver.get("https://login.sina.com.cn/signup/signin.php")
wait = WebDriverWait(driver,5)
#重要：暫停1分鐘進行預(yù)登陸，此處填寫賬號密碼及驗證
time.sleep(60)

'''輸入關(guān)鍵詞到搜索框，完成搜索'''
#使用selector去定位關(guān)鍵詞搜索框
s_input = driver.find_element_by_css_selector('#search_input')
#向搜索框中傳入字段
s_input.send_keys("努力學(xué)習(xí)")
#定位搜索鍵
confirm_btn = driver.find_element_by_css_selector('#search_submit')
#點擊
confirm_btn.click()

#人為移動driver
driver.switch_to.window(driver.window_handles[1])

'''爬取第一頁數(shù)據(jù)'''
comment = []
username = []

#抓取節(jié)點：每個評論為一個節(jié)點（包括用戶信息、評論、日期等信息），如果一頁有20條評論，那么nodes的長度就為20
nodes = driver.find_elements_by_css_selector('div.card > div.card-feed > div.content')

#對每個節(jié)點進行循環(huán)操作
for i in range(0,len(nodes),1):
    #判斷每個節(jié)點是否有“展開全文”的鏈接
    flag = False
    try:
        nodes[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").is_displayed()
        flag = True  
    except:
        flag = False
    
    #如果該節(jié)點具有“展開全文”的鏈接，且該鏈接中的文字是“展開全文c”，那么點擊這個要素，并獲取指定位置的文本；否則直接獲取文本
    #（兩個條件需要同時滿足，因為該selector不僅標(biāo)識了展開全文，還標(biāo)識了其他元素，沒有做到唯一定位）
    if(flag and nodes[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").text.startswith('展開全文c')):
        nodes[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").click()
        comment.append(nodes[i].find_element_by_css_selector('p[node-type="feed_list_content_full"]').text)
    else:
        comment.append(nodes[i].find_element_by_css_selector('p[node-type="feed_list_content"]').text)
    username.append(nodes[i].find_element_by_css_selector("div.info>div:nth-child(2)>a").text)

'''循環(huán)操作，獲取剩余頁數(shù)的數(shù)據(jù)'''
for page in range(49):
    print(page)
    # 定位下一頁按鈕
    nextpage_button = driver.find_element_by_link_text('下一頁')
    #點擊按鍵
    driver.execute_script("arguments[0].click();", nextpage_button)
    wait = WebDriverWait(driver,5)
    #與前面類似
    nodes1 = driver.find_elements_by_css_selector('div.card > div.card-feed > div.content')
    for i in range(0,len(nodes1),1):
        flag = False
        try:
            nodes1[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").is_displayed()
            flag = True
        
        except:
            flag = False
        if (flag and nodes1[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").text.startswith('展開全文c')):
            nodes1[i].find_element_by_css_selector("p>a[action-type='fl_unfold']").click()
            comment.append(nodes1[i].find_element_by_css_selector('p[node-type="feed_list_content_full"]').text)
        else:
            comment.append(nodes1[i].find_element_by_css_selector('p[node-type="feed_list_content"]').text)
        username.append(nodes1[i].find_element_by_css_selector("div.info>div:nth-child(2)>a").text)

'''保存數(shù)據(jù)'''
data = pd.DataFrame({'username':username,'comment':comment})
data.to_excel("weibo.xlsx")

到此這篇關(guān)于使用Selenium實現(xiàn)微博爬蟲(預(yù)登錄、展開全文、翻頁)的文章就介紹到這了,更多相關(guān)Selenium 微博爬蟲內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

使用Selenium實現(xiàn)微博爬蟲(預(yù)登錄、展開全文、翻頁)

前言

一、區(qū)分動態(tài)爬蟲和靜態(tài)爬蟲

二、動態(tài)爬蟲的兩種方法

三、安裝Selenium庫及下載瀏覽器補丁

四、頁面打開及預(yù)登錄

五、關(guān)鍵詞搜索操作

六、識別“展開全文”并爬取數(shù)據(jù)

七、設(shè)置翻頁

八、保存數(shù)據(jù)

九、完整代碼

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

使用Selenium實現(xiàn)微博爬蟲(預(yù)登錄、展開全文、翻頁)

前言

一、區(qū)分動態(tài)爬蟲和靜態(tài)爬蟲

二、動態(tài)爬蟲的兩種方法

三、安裝Selenium庫及下載瀏覽器補丁

四、頁面打開及預(yù)登錄

五、關(guān)鍵詞搜索操作

六、識別“展開全文”并爬取數(shù)據(jù)

七、設(shè)置翻頁

八、保存數(shù)據(jù)

九、完整代碼

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

一、區(qū)分動態(tài)爬蟲和靜態(tài)爬蟲

二、動態(tài)爬蟲的兩種方法

四、頁面打開及預(yù)登錄

五、關(guān)鍵詞搜索操作

六、識別“展開全文”并爬取數(shù)據(jù)

八、保存數(shù)據(jù)

九、完整代碼