快捷導(dǎo)航

Python實(shí)現(xiàn)獲取網(wǎng)頁(yè)信息并解析

更新時(shí)間：2025年05月22日 10:50:58 作者：KevinQ

這篇文章主要為大家詳細(xì)介紹了如何使用Python實(shí)現(xiàn)獲取網(wǎng)頁(yè)信息并解析功能,文中的示例代碼講解詳細(xì),感興趣的小伙伴可以跟隨小編一起學(xué)習(xí)一下

Python爬蟲用到的兩個(gè)主要的庫(kù)是：bs4和request，request用于發(fā)起請(qǐng)求，而bs4用于網(wǎng)頁(yè)元素解析。

以阮一峰老師的博客為例，每周最喜歡的是科學(xué)愛(ài)好者周刊中的“言論”不分，以科技愛(ài)好者周刊（第 253 期）為例，讓我們來(lái)看看能不能將言論部分提取出來(lái)。

import requests  
from bs4 import BeautifulSoup  
  
url = "http://www.ruanyifeng.com/blog/2023/05/weekly-issue-253.html"  
response = requests.get(url)  
soup = BeautifulSoup(response.content, "html.parser")  
first_tag = soup.find("h2", string="言論")  
next_sibling = first_tag.find_next_sibling()  
content1 = ""  
while next_sibling.name != "h2":  
    content1 += str(next_sibling.get_text())  
    # content1 += str(next_sibling)  
    content1 += "\n\n"  
    next_sibling = next_sibling.find_next_sibling()  
print(content1)

執(zhí)行結(jié)果：

用到的重要函數(shù)是查找某個(gè)tag，獲取某個(gè)tag的下一個(gè)tag函數(shù)：

find與find_all

函數(shù)定義如下：

def find(self, name=None, attrs={}, recursive=True, text=None,  
         **kwargs):  
    """Look in the children of this PageElement and find the first  
    PageElement that matches the given criteria.  
    All find_* methods take a common set of arguments. See the online    documentation for detailed explanations.  
    :param name: A filter on tag name.    :param attrs: A dictionary of filters on attribute values.    :param recursive: If this is True, find() will perform a        recursive search of this PageElement's children. Otherwise,        only the direct children will be considered.    :param limit: Stop looking after finding this many results.    :kwargs: A dictionary of filters on attribute values.    :return: A PageElement.  
    :rtype: bs4.element.PageElement  
    """    
    r = None  
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)  
    if l:  
        r = l[0]  
    return r

def find_all(self, name=None, attrs={}, recursive=True, text=None,  
             limit=None, **kwargs):  
    """Look in the children of this PageElement and find all  
    PageElements that match the given criteria.  
    All find_* methods take a common set of arguments. See the online    documentation for detailed explanations.  
    :param name: A filter on tag name.    :param attrs: A dictionary of filters on attribute values.    :param recursive: If this is True, find_all() will perform a        recursive search of this PageElement's children. Otherwise,        only the direct children will be considered.    :param limit: Stop looking after finding this many results.    :kwargs: A dictionary of filters on attribute values.    :return: A ResultSet of PageElements.  
    :rtype: bs4.element.ResultSet  
    """    
    generator = self.descendants  
    if not recursive:  
        generator = self.children  
    return self._find_all(name, attrs, text, limit, generator, **kwargs)

find 返回的是一個(gè)元素，find_all返回的是一個(gè)列表，舉例說(shuō)明比較清晰。

允許傳入的參數(shù)包括：

1.字符串：tag的名稱，如h2, p, b, a等等分別表示查找<h2>, <p>, <b>, <a>等標(biāo)簽。如：

soup.find_all('b')
# [<b>這里加粗</b>]

2.正則表達(dá)式

# 導(dǎo)入包
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# 結(jié)果會(huì)找出 body, b等b開(kāi)頭的標(biāo)簽

.3列表：與列表中任一元素匹配的內(nèi)容返回

soup.find_all(["a", "b"])
# 輸出： [<b>加粗</b>,
#  <a class="ddd" href="http://xxx" rel="external nofollow" >xxx</a> ]

4.True: 返回所有非字符串節(jié)點(diǎn)。

5.方法：傳入的方法接受唯一參數(shù)：元素，并返回True或者False，若元素計(jì)算的值為True，則返回。

# 判斷一個(gè)tag有class屬性，但是沒(méi)有id屬性
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
# 使用方式
soup.find_all(has_class_but_no_id)

6.對(duì)元素指定判斷函數(shù)：

# 查找所有href標(biāo)簽不是https的a標(biāo)簽
def not_https(href):
        return href and not re.compile("https").search(href)
soup.find_all(href=not_https)

通過(guò)上述第5種和第6種方法，可以構(gòu)造很復(fù)雜的tag過(guò)濾函數(shù)，從而實(shí)現(xiàn)過(guò)濾目的。

其他相關(guān)搜索函數(shù)如下：

find_next_sibling 返回后面的第一個(gè)同級(jí)tag節(jié)點(diǎn) find_previous_sibling 返回前面的第一個(gè)同級(jí)tag節(jié)點(diǎn) find_next 后面第一個(gè)tag節(jié)點(diǎn) find_previous 前面第一個(gè)tag節(jié)點(diǎn)

更多內(nèi)容可以在bs4官方文檔中查看。

到此這篇關(guān)于Python實(shí)現(xiàn)獲取網(wǎng)頁(yè)信息并解析的文章就介紹到這了,更多相關(guān)Python獲取網(wǎng)頁(yè)信息內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: