Python?使用BeautifulSoup庫(kù)的方法

更新時(shí)間：2023年10月27日 09:47:47 作者：Lyshark

BeautifulSoup庫(kù)用于從HTML或XML文件中提取數(shù)據(jù),它可以自動(dòng)將復(fù)雜的HTML文檔轉(zhuǎn)換為樹(shù)形結(jié)構(gòu),并提供簡(jiǎn)單的方法來(lái)搜索文檔中的節(jié)點(diǎn),使得我們可以輕松地遍歷和修改HTML文檔的內(nèi)容,本文給大家介紹Python?使用BeautifulSoup庫(kù)的方法,感興趣的朋友一起看看吧

BeautifulSoup庫(kù)用于從HTML或XML文件中提取數(shù)據(jù)。它可以自動(dòng)將復(fù)雜的HTML文檔轉(zhuǎn)換為樹(shù)形結(jié)構(gòu)，并提供簡(jiǎn)單的方法來(lái)搜索文檔中的節(jié)點(diǎn)，使得我們可以輕松地遍歷和修改HTML文檔的內(nèi)容。廣泛用于Web爬蟲(chóng)和數(shù)據(jù)抽取應(yīng)用程序中。

讀者如果需要使用這個(gè)庫(kù)，同樣需要執(zhí)行pip命令用以安裝：

安裝PIP包：pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple21.8.1

屬性定位鏈接

通過(guò)HTML屬性我們可以輕松的實(shí)現(xiàn)對(duì)特定頁(yè)面特定元素的提取，如下代碼我們首先封裝兩個(gè)函數(shù)，其中get_page_attrs函數(shù)用于一次性解析需求，函數(shù)search_page則用于多次對(duì)頁(yè)面進(jìn)行解析，這兩個(gè)函數(shù)如果傳入attribute屬性則用于提取屬性?xún)?nèi)的參數(shù)，而傳入text則用于提取屬性自身文本。

import requests
from bs4 import BeautifulSoup

header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98"}

# 參數(shù)1: 解析頁(yè)面URL
# 參數(shù)2: 需要解析的頁(yè)面定位
# 參數(shù)3: 提取標(biāo)簽屬性
# 參數(shù)4：設(shè)置超時(shí)時(shí)間
# 參數(shù)5：設(shè)置返回類(lèi)型(attribute 返回屬性字段,text 返回文本字段)
def get_page_attrs(url,regx,attrs,timeout,type):
    respon_page = []
    try:
        respon = requests.get(url=url, headers=header, timeout=timeout)
        if respon.status_code == 200:
            if respon != None:
                soup = BeautifulSoup(respon.text, "html.parser")
                ret = soup.select(regx)
                for item in ret:
                    if type == "attribute":
                        respon_page.append( str(item.attrs[attrs] ))
                    if type == "text":
                        respon_page.append(str(item.get_text()))

            return respon_page
        else:
            return None
    except Exception:
        return None
    return None

# 對(duì)頁(yè)面多次搜索
# 參數(shù)1: 需要解析的html文本
# 參數(shù)2: 需要解析的頁(yè)面定位
# 參數(shù)3: 提取標(biāo)簽屬性
# 參數(shù)5：設(shè)置返回類(lèi)型(attribute 返回屬性字段,text 返回文本字段)
def search_page(data,regx,attrs,type):
    respon_page = []
    if data != None:
        soup = BeautifulSoup(data, "html.parser")
        ret = soup.select(regx)
        for item in ret:
            if type == "attribute":
                respon_page.append( str(item.attrs[attrs] ))
            if type == "text":
                respon_page.append(str(item.get_text()))
    return respon_page

通過(guò)使用上述兩個(gè)封裝函數(shù)，讀者就可以輕松的實(shí)現(xiàn)對(duì)特定網(wǎng)頁(yè)頁(yè)面元素的定位，首先我們通過(guò)CSS屬性定位一篇文章中的圖片鏈接，這段代碼如下；

if __name__ == "__main__":
    # 通過(guò)CSS屬性定位圖片
    ref = get_page_attrs("https://www.cnblogs.com/LyShark/p/15914868.html",
                   "#cnblogs_post_body > p > img",
                   "src",
                   5,
                   "attribute"
                   )
    print(ref)

當(dāng)上述代碼運(yùn)行后，即可提取出特定網(wǎng)址鏈接內(nèi)，屬性#cnblogs_post_body > p > img中圖片的src屬性，并提取出圖片屬性attribute自身參數(shù)。

接著我們繼續(xù)使用該函數(shù)實(shí)現(xiàn)定位文章列表功能，文章列表的定位同理，此處第二個(gè)參數(shù)應(yīng)修改為href屬性，如下代碼分別使用兩種方式實(shí)現(xiàn)對(duì)文章列表的定位功能；

if __name__ == "__main__":
    # 定位文章列表,兩種方式均可
    ref = get_page_attrs("https://www.cnblogs.com/lyshark",
                   "#mainContent > div > div > div.postTitle > a",
                   "href",
                   5,
                   "attribute"
                   )
    print(ref)
    ref = get_page_attrs("https://www.cnblogs.com/lyshark",
                   "div[class='day'] div[class='postCon'] div a",
                   "href",
                   5,
                   "attribute"
                   )
    print(ref)

代碼運(yùn)行后即可輸出lyshark網(wǎng)站中主頁(yè)所有的文章地址信息，輸出如下圖所示；

當(dāng)需要定位文章內(nèi)容時(shí)，我們只需要將第二個(gè)屬性更改為空格，并將第四個(gè)屬性修改為text此時(shí)則代表只提取屬性?xún)?nèi)的文本。

if __name__ == "__main__":
    # 定位文章文本字段
    ref = get_page_attrs("https://www.cnblogs.com/lyshark",
                   "div[class='day'] div[class='postCon'] div[class='c_b_p_desc']",
                   "",
                   5,
                   "text"
                   )
    for index in ref:
        print(index)

運(yùn)行上述代碼片段，即可提取出主頁(yè)中所有的文本信息，如下圖所示；

如果需要在同一個(gè)頁(yè)面中多次定位那么就需要使用search_page函數(shù)了，如下代碼中我們需要在一個(gè)頁(yè)面內(nèi)尋找兩個(gè)元素，此時(shí)就需要定位兩次；

if __name__ == "__main__":
    respon = requests.get(url="https://yiyuan.9939.com/yyk_47122/", headers=header, timeout=5)
    ref = search_page(respon.text,
                      "body > div.hos_top > div > div.info > div.detail.word-break > h1 > a",
                      "",
                      "text"
                      )
    print(ref)
    ref = search_page(respon.text,
                      "body > div.hos_top > div > div.info > div.detail.word-break > div.tel > span",
                      "",
                      "text"
                      )
    print(ref)

代碼運(yùn)行后，即可通過(guò)依次請(qǐng)求，分別輸出該頁(yè)面中的兩個(gè)元素，如下圖所示；

21.8.2 查詢(xún)所有標(biāo)簽

使用find_all函數(shù)，可實(shí)現(xiàn)從HTML或XML文檔中查找所有符合指定標(biāo)簽和屬性的元素，返回一個(gè)列表，該函數(shù)從用于精確過(guò)濾，可同時(shí)將該頁(yè)中符合條件的數(shù)據(jù)一次性全部篩選出來(lái)。

其基本語(yǔ)法為：

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

name：標(biāo)簽名或列表，用于查找指定標(biāo)簽名的元素，如果為 True 或 None，則查找所有標(biāo)簽元素
attrs：字典，用于指定屬性名和屬性值，用于查找具有指定屬性名和屬性值的元素
recursive：布爾值，表示是否遞歸查找子標(biāo)簽，默認(rèn)為
Truetext：字符串或正則表達(dá)式，用于匹配元素的文本內(nèi)容
limit：整數(shù)，限制返回的匹配元素的數(shù)量
kwargs：可變參數(shù)，用于查找指定屬性名和屬性值的元素

我們以輸出CVE漏洞列表為例，通過(guò)使用find_all查詢(xún)頁(yè)面中所有的a標(biāo)簽，并返回一個(gè)列表，通過(guò)對(duì)列表元素的解析，依次輸出該漏洞的序號(hào)，網(wǎng)址，以及所對(duì)應(yīng)的編號(hào)信息。

import re
import requests
from bs4 import BeautifulSoup
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98"}
# 查找文中 所有a標(biāo)簽 且類(lèi)名是c_b_p_desc_readmore的 并提取出其href字段
# print(bs.find_all('a',class_='c_b_p_desc_readmore')[0]['href'])
# 提取 所有a標(biāo)簽 且id等于blog_nav_admin 類(lèi)等于menu 并提取出其href字段
# print(bs.find_all('a',id='blog_nav_admin',class_='menu')[0]['href'])
# print(bs.find_all('a',id='blog_nav_admin',class_='menu')[0].attrs['href'])
if __name__ == "__main__":
    url = "https://cassandra.cerias.purdue.edu/CVE_changes/today.html"
    new_cve = []
    ret = requests.get(url=url, headers=header, timeout=5)
    soup = BeautifulSoup(ret.text, 'html.parser')
    for index in soup.find_all('a'):
        href = index.get('href')
        text = index.get_text()
        cve_number = re.findall("[0-9]{1,}-.*",index.get_text())
        print("序號(hào): {:20} 地址: {} CVE-{}".format(text,href,cve_number[0]))

讀者可自行運(yùn)行上述代碼，即可匹配出當(dāng)前頁(yè)面中所有的CVE漏洞編號(hào)等，如下圖所示；

21.8.3 取字串返回列表

在BeautifulSoup4中，stripped_strings是一個(gè)生成器對(duì)象，用于獲取HTML標(biāo)簽內(nèi)所有文本內(nèi)容的迭代器。它會(huì)自動(dòng)去除每個(gè)文本的前后空格和換行符，只返回純文本字符串。stripped_strings可以用于處理HTML文檔中的多行文本、空格等特殊符號(hào)，也可用于將元素下面的所有字符串以列表的形式返回。

import requests
from bs4 import BeautifulSoup
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98"}
if __name__ == "__main__":
    ret = requests.get(url="https://www.cnblogs.com/lyshark", headers=header, timeout=3)
    text = str(ret.content.decode('utf-8'))
    bs = BeautifulSoup(text, "html.parser")
    ret = bs.select('#mainContent > div > div > div.postTitle > a > span')
    for i in ret:
        # 提取出字符串并以列表的形式返回
        string_ = list(i.stripped_strings)
        print(string_)

運(yùn)行后即可獲取選中元素的字符串內(nèi)容，并通過(guò)list將其轉(zhuǎn)換為列表格式，如下圖所示；

通過(guò)find_all以及stripped_strings屬性我們實(shí)現(xiàn)一個(gè)簡(jiǎn)單的抓取天氣的代碼，以讓讀者可以更好的理解該屬性是如何被使用的，如下代碼所示；

from bs4 import BeautifulSoup
import requests

head = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
ret = requests.get(url="http://www.weather.com.cn/textFC/beijing.shtml", headers=head, timeout=3)
text = str(ret.content.decode('utf-8'))

bs = BeautifulSoup(text,"html.parser")

# 定位到第一個(gè)標(biāo)簽上
bs.find_all('div',class_='conMidtab')[1]

# 在conMidtab里面找tr標(biāo)簽并從第3個(gè)標(biāo)簽開(kāi)始保存
tr = bs.find_all('tr')[2:]

for i in tr:
    # 循環(huán)找代碼中的所有td標(biāo)簽
    td = i.find_all('td')
    # 找所有的td標(biāo)簽,并找出第一個(gè)td標(biāo)簽
    city_td = td[0]
    # 獲取目標(biāo)路徑下所有的子孫非標(biāo)簽字符串,自動(dòng)去掉空字符串
    city = list(city_td.stripped_strings)[0]
    # 取出度數(shù)的標(biāo)簽
    temp = td[-5]
    temperature = list(temp.stripped_strings)[0]
    print('城市:{}   溫度:{}'.format(city,temperature))

我們以提取北京天氣為案例，當(dāng)運(yùn)行代碼后即可取出北京市所有地區(qū)的氣溫?cái)?shù)據(jù)，如下圖所示；