Python爬蟲之BeautifulSoup的基本使用教程

更新時間：2022年03月29日 09:38:17 作者：hacker707

Beautiful Soup提供一些簡單的、python式的函數(shù)用來處理導航、搜索、修改分析樹等功,下面這篇文章主要給大家介紹了關于Python爬蟲之BeautifulSoup的基本使用教程,需要的朋友可以參考下

bs4的安裝

要使用BeautifulSoup4需要先安裝lxml,再安裝bs4

pip install lxml

pip install bs4

使用方法：

from bs4 import BeautifulSoup

lxml和bs4對比學習

from lxml import etree
tree = etree.HTML(html)
tree.xpath()

from bs4 import BeautifulSoup
soup =  BeautifulSoup(html_doc, 'lxml')

注意事項：

創(chuàng)建soup對象時如果不傳’lxml’或者features="lxml"會出現(xiàn)以下警告

bs4的快速入門

解析器的比較(了解即可)

解析器	用法	優(yōu)點	缺點
python標準庫	BeautifulSoup(markup,‘html.parser’)	python標準庫，執(zhí)行速度適中	(在python2.7.3或3.2.2之前的版本中)文檔容錯能力差
lxml的HTML解析器	BeautifulSoup(markup,‘lxml’)	速度快，文檔容錯能力強	需要安裝c語言庫
lxml的XML解析器	BeautifulSoup(markup,‘lxml-xml’)或者BeautifulSoup(markup,‘xml’)	速度快，唯一支持XML的解析器	需要安裝c語言庫
html5lib	BeautifulSoup(markup,‘html5lib’)	最好的容錯性，以瀏覽器的方式解析文檔，生成HTML5格式的文檔	速度慢，不依賴外部擴展

對象種類

Tag：標簽
BeautifulSoup：bs對象
NavigableString：可導航的字符串
Comment：注釋

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

<span><!--comment注釋內(nèi)容舉例--></span>
"""
# 創(chuàng)建soup對象
soup = BeautifulSoup(html_doc, 'lxml')
print(type(soup.title))  # <class 'bs4.element.Tag'>
print(type(soup))  # <class 'bs4.BeautifulSoup'>
print(type(soup.title.string))  # <class 'bs4.element.NavigableString'>
print(type(soup.span.string))  # <class 'bs4.element.Comment'>

bs4的簡單使用

獲取標簽內(nèi)容

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
# 創(chuàng)建soup對象
soup = BeautifulSoup(html_doc, 'lxml')
print('head標簽內(nèi)容:\n', soup.head)  # 打印head標簽
print('body標簽內(nèi)容:\n', soup.body)  # 打印body標簽
print('html標簽內(nèi)容:\n', soup.html)  # 打印html標簽
print('p標簽內(nèi)容:\n', soup.p)  # 打印p標簽

注意：在打印p標簽對應的代碼時，可以發(fā)現(xiàn)只打印了第一個p標簽內(nèi)容，這時我們可以通過find_all來獲取p標簽全部內(nèi)容

print('p標簽內(nèi)容:\n', soup.find_all('p'))

?這里需要注意使用find_all里面必須傳入的是字符串

獲取標簽名字

通過name屬性獲取標簽名字

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
# 創(chuàng)建soup對象
soup = BeautifulSoup(html_doc, 'lxml')
print('head標簽名字:\n', soup.head.name)  # 打印head標簽名字
print('body標簽名字:\n', soup.body.name)  # 打印body標簽名字
print('html標簽名字:\n', soup.html.name)  # 打印html標簽名字
print('p標簽名字:\n', soup.find_all('p').name)  # 打印p標簽名字

如果要找到兩個標簽的內(nèi)容，需要傳入列表過濾器，而不是字符串過濾器

使用字符串過濾器獲取多個標簽內(nèi)容會返回空列表

print(soup.find_all('title', 'p'))

[]

需要使用列表過濾器獲取多個標簽內(nèi)容

print(soup.find_all(['title', 'p']))

[<title>The Dormouse's story</title>, The Dormouse's story, Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well., ...]

獲取a標簽的href屬性值

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 創(chuàng)建soup對象
soup = BeautifulSoup(html_doc, 'lxml')
a_list = soup.find_all('a')
# 遍歷列表取屬性值
for a in a_list:
    # 第一種方法通過get去獲取href屬性值(沒有找到返回None)
    print(a.get('href'))
    # 第二種方法先通過attrs獲取所有屬性值，再提取出你想要的屬性值
    print(a.attrs['href'])
    # 第三種方法獲取沒有的屬性值會報錯
    print(a['href'])

擴展：使用prettify()美化讓節(jié)點層級關系更加明顯方便分析

print(soup.prettify())

不使用prettify時的代碼

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister"  id="link1">Elsie</a>,
<a class="sister"  id="link2">Lacie</a> and
<a class="sister"  id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>

遍歷文檔樹

from bs4 import BeautifulSoup

# 創(chuàng)建模擬HTML代碼的字符串
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  class="sister" id="link1">Elsie</a>,
<a  class="sister" id="link2">Lacie</a> and
<a  class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'lxml')
head = soup.head
# contents返回的是所有子節(jié)點的列表 [<title>The Dormouse's story</title>]
print(head.contents)
# children返回的是一個子節(jié)點的迭代器 <list_iterator object at 0x00000264BADC2748>
print(head.children)
# 凡是迭代器都是可以遍歷的
for h in head.children:
    print(h)
html = soup.html  # 會把換行也當作子節(jié)點匹配到
# descendants 返回的是一個生成器遍歷子子孫孫  <generator object Tag.descendants at 0x0000018C15BFF4C8>
print(html.descendants)
# 凡是生成器都是可遍歷的
for h in html.descendants:
    print(h)

'''
需要重點掌握的
string獲取標簽里面的內(nèi)容
strings 返回是一個生成器對象用過來獲取多個標簽內(nèi)容
stripped_strings 和strings基本一致 但是它可以把多余的空格去掉
'''
print(soup.title.string)
print(soup.html.string)
# 返回生成器對象<generator object Tag._all_strings at 0x000001AAFF9EF4C8>
# soup.html.strings 包含在html標簽里面的文本都會被獲取到
print(soup.html.strings)
for h in soup.html.strings:
    print(h)
# stripped_strings可以把多余的空格去掉
# 返回生成器對象<generator object PageElement.stripped_strings at 0x000001E31284F4C8>
print(soup.html.stripped_strings)
for h in soup.html.stripped_strings:
    print(h)
'''
parent直接獲得父節(jié)點
parents獲取所有的父節(jié)點
'''
title = soup.title
# parent找直接父節(jié)點
print(title.parent)
# parents獲取所有父節(jié)點
# 返回生成器對象<generator object PageElement.parents at 0x000001F02049F4C8>
print(title.parents)
for p in title.parents:
    print(p)
# html的父節(jié)點就是整個文檔
print(soup.html.parent)
# <class 'bs4.BeautifulSoup'>
print(type(soup.html.parent))

案例練習

獲取所有職位名稱

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">職位名稱</td>
            <td>職位類別</td>
            <td>人數(shù)</td>
            <td>地點</td>
            <td>發(fā)布時間</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級研發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級后臺開發(fā)</a></td>
            <td>技術類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂運營開發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂業(yè)務運維工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級研發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級圖像算法研發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級AI開發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級業(yè)務運維工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""

思路

不難看出想要的數(shù)據(jù)在tr節(jié)點的a標簽里，只需要遍歷所有的tr節(jié)點，從遍歷出來的tr節(jié)點取a標簽里面的文本數(shù)據(jù)

代碼實現(xiàn)

from bs4 import BeautifulSoup

html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">職位名稱</td>
            <td>職位類別</td>
            <td>人數(shù)</td>
            <td>地點</td>
            <td>發(fā)布時間</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級研發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級后臺開發(fā)</a></td>
            <td>技術類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂運營開發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂業(yè)務運維工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級研發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級圖像算法研發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級AI開發(fā)工程師（深圳）</a></td>
            <td>技術類</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級業(yè)務運維工程師（深圳）</a></td>
            <td>技術類</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
# 創(chuàng)建soup對象
soup = BeautifulSoup(html, 'lxml')
# 使用find_all()找到所有的tr節(jié)點(經(jīng)過觀察第一個tr節(jié)點為表頭,忽略不計)
tr_list = soup.find_all('tr')[1:]
# 遍歷tr_list取a標簽里的文本數(shù)據(jù)
for tr in tr_list:
    a_list = tr.find_all('a')
    print(a_list[0].string)

運行結果如下：

22989-金融云區(qū)塊鏈高級研發(fā)工程師（深圳）
22989-金融云高級后臺開發(fā)
SNG16-騰訊音樂運營開發(fā)工程師（深圳）
SNG16-騰訊音樂業(yè)務運維工程師（深圳）
TEG03-高級研發(fā)工程師（深圳）
TEG03-高級圖像算法研發(fā)工程師（深圳）
TEG11-高級AI開發(fā)工程師（深圳）
15851-后臺開發(fā)工程師
15851-后臺開發(fā)工程師
SNG11-高級業(yè)務運維工程師（深圳）

總結

到此這篇關于Python爬蟲之BeautifulSoup基本使用的文章就介紹到這了,更多相關Python BeautifulSoup使用內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

在Python中使用defaultdict初始化字典以及應用方法
今天小編就為大家分享一篇在Python中使用defaultdict初始化字典以及應用方法，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧
2018-10-10
Pycharm中import torch報錯的快速解決方法
這篇文章主要介紹了Pycharm中import torch報錯的快速解決方法,很多朋友容易碰到這個問題，今天小編特此把解決方案分享到腳本之家平臺供大家參考，需要的朋友可以參考下
2020-03-03
情人節(jié)快樂! python繪制漂亮玫瑰
情人節(jié)快樂! 這篇文章主要教大家如何用python繪制漂亮玫瑰花，文中示例代碼介紹的非常詳細，具有一定的參考價值，感興趣的小伙伴們可以參考一下
2019-02-02
利用Python獲取文件夾下所有文件實例代碼
在處理數(shù)據(jù)的過程中經(jīng)常需要遍歷文件夾,如果遠程服務器的文件是分布式存儲,遍歷需要更快的速度,下面這篇文章主要給大家介紹了關于利用Python獲取文件夾下所有文件的相關資料,需要的朋友可以參考下
2023-01-01
Python關于print的操作(倒計時、轉圈顯示、進度條)
這篇文章主要介紹了Python關于print的操作(倒計時、轉圈顯示、進度條)，具有很好的參考價值，希望對大家有所幫助。如有錯誤或未考慮完全的地方，望不吝賜教
2023-05-05
Python Flask token身份認證的示例代碼(附完整代碼)
在Web應用中,經(jīng)常需要進行身份認證,以確保只有授權用戶才能訪問某些資源,本文主要介紹了Python Flask token身份認證的示例代碼,具有一定的參考價值,感興趣的可以了解一下
2023-11-11
分享vim python縮進等一些配置
本篇文章給大家分享了vim python縮進等一些配置的相關知識點，有需要的朋友可以參考下。
2018-07-07
Pycharm制作搞怪彈窗的實現(xiàn)代碼
這篇文章主要介紹了Pycharm制作搞怪彈窗(聲音強制最大，屏幕亮度強制最亮，按鈕躲避，彈窗炸彈）,本文通過實例代碼給大家介紹的非常詳細，對大家的學習或工作具有一定的參考借鑒價值，需要的朋友可以參考下
2021-02-02
Python實現(xiàn)提取指定名稱的文件并批量復制到其他文件夾
本文介紹基于Python語言,讀取一個文件夾,并將其中每一個子文件夾內(nèi)符合名稱要求的文件加以篩選,并將篩選得到的文件復制到另一個目標文件夾中的方法,需要的朋友可以參考下
2023-10-10
Python結合Sprak實現(xiàn)計算曲線與X軸上方的面積
這篇文章主要介紹了Python結合Sprak實現(xiàn)計算曲線與X軸上方的面積，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習吧
2023-02-02