Python下利用BeautifulSoup解析HTML的實現(xiàn)

更新時間：2020年01月17日 09:32:33 作者：東凌閣

這篇文章主要介紹了Python下利用BeautifulSoup解析HTML的實現(xiàn)，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧

摘要

Beautiful Soup 是一個可以從 HTML 或 XML 格式文件中提取數(shù)據(jù)的 Python 庫，他可以將HTML 或 XML 數(shù)據(jù)解析為Python 對象，以方便通過Python代碼進行處理。

文檔環(huán)境

Centos7.5
Python2.7
BeautifulSoup4

Beautifu Soup 使用說明

Beautiful Soup 的基本功能就是對HTML的標簽進行查找及編輯。

基本概念-對象類型

Beautiful Soup 將復雜 HTML 文檔轉換成一個復雜的樹形結構，每個節(jié)點都被轉換成一個Python 對象，Beautiful Soup將這些對象定義了4 種類型: Tag、NavigableString、BeautifulSoup、Comment 。

對象類型	描述
BeautifulSoup	文檔的全部內容
Tag	HTML的標簽
NavigableString	標簽包含的文字
Comment	是一種特殊的NavigableString類型，當標簽中的NavigableString 被注釋時，則定義為該類型

安裝及引用

# Beautiful Soup
pip install bs4

# 解析器
pip install lxml
pip install html5lib

# 初始化
from bs4 import BeautifulSoup

# 方法一，直接打開文件
soup = BeautifulSoup(open("index.html"))

# 方法二，指定數(shù)據(jù)
resp = "<html>data</html>"
soup = BeautifulSoup(resp, 'lxml')

# soup 為 BeautifulSoup 類型對象
print(type(soup))

標簽搜索及過濾

基本方法

標簽搜索有find_all() 和find() 兩個基本的搜索方法，find_all() 方法會返回所有匹配關鍵字的標簽列表，find()方法則只返回一個匹配結果。

soup = BeautifulSoup(resp, 'lxml')

# 返回一個標簽名為"a"的Tag
soup.find("a")

# 返回所有tag 列表
soup.find_all("a")

## find_all方法可被簡寫
soup("a")

#找出所有以b開頭的標簽
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)

#找出列表中的所有標簽
soup.find_all(["a", "p"])

# 查找標簽名為p，class屬性為"title"
soup.find_all("p", "title")

# 查找屬性id為"link2"
soup.find_all(id="link2")

# 查找存在屬性id的
soup.find_all(id=True)

#
soup.find_all(href=re.compile("elsie"), id='link1')

# 
soup.find_all(attrs={"data-foo": "value"})

#查找標簽文字包含"sisters"
soup.find(string=re.compile("sisters"))

# 獲取指定數(shù)量的結果
soup.find_all("a", limit=2)

# 自定義匹配方法
def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')
soup.find_all(has_class_but_no_id)

# 僅對屬性使用自定義匹配方法
def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)

# 調用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節(jié)點,如果只想搜索tag的直接子節(jié)點,可以使用參數(shù) recursive=False 

soup.find_all("title", recursive=False)

擴展方法

ind_parents()	所有父輩節(jié)點
find_parent()	第一個父輩節(jié)點
find_next_siblings()	之后的所有兄弟節(jié)點
find_next_sibling()	之后的第一個兄弟節(jié)點
find_previous_siblings()	之前的所有兄弟節(jié)點
find_previous_sibling()	之前的第一個兄弟節(jié)點
find_all_next()	之后的所有元素
find_next()	之后的第一個元素
find_all_previous()	之前的所有元素
find_previous()	之前的第一個元素

CSS選擇器

Beautiful Soup支持大部分的CSS選擇器 http://www.w3.org/TR/CSS2/selector.html, 在 Tag 或 BeautifulSoup 對象的 .select() 方法中傳入字符串參數(shù), 即可使用CSS選擇器的語法找到tag。

html_doc = """
<html>
<head>
 <title>The Dormouse's story</title>
</head>
<body>
 <p class="title"><b>The Dormouse's story</b></p>

 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a  rel="external nofollow" class="sister" id="link1">Elsie</a>,
  <a  rel="external nofollow" class="sister" id="link2">Lacie</a>
  and
  <a  rel="external nofollow" class="sister" id="link3">Tillie</a>;
  and they lived at the bottom of a well.
 </p>

 <p class="story">...</p>
"""

soup = BeautifulSoup(html_doc)

# 所有 a 標簽
soup.select("a")

# 逐層查找
soup.select("body a")
soup.select("html head title")

# tag標簽下的直接子標簽
soup.select("head > title")
soup.select("p > #link1")

# 所有匹配標簽之后的兄弟標簽
soup.select("#link1 ~ .sister")

# 匹配標簽之后的第一個兄弟標簽
soup.select("#link1 + .sister")

# 根據(jù)calss類名
soup.select(".sister")
soup.select("[class~=sister]")

# 根據(jù)ID查找
soup.select("#link1")
soup.select("a#link1")

# 根據(jù)多個ID查找
soup.select("#link1,#link2")

# 根據(jù)屬性查找
soup.select('a[href]')

# 根據(jù)屬性值查找
soup.select('a[href^="http://example.com/"]')
soup.select('a[href$="tillie"]')
soup.select('a[href*=".com/el"]')

# 只獲取一個匹配結果
soup.select(".sister", limit=1)

# 只獲取一個匹配結果
soup.select_one(".sister")

標簽對象方法

標簽屬性

soup = BeautifulSoup('<p class="body strikeout" id="1">Extremely bold</p><p class="body strikeout" id="2">Extremely bold2</p>')
# 獲取所有的 p標簽對象
tags = soup.find_all("p")
# 獲取第一個p標簽對象
tag = soup.p
# 輸出標簽類型 
type(tag)
# 標簽名
tag.name
# 標簽屬性
tag.attrs
# 標簽屬性class 的值
tag['class']
# 標簽包含的文字內容，對象NavigableString 的內容
tag.string

# 返回標簽內所有的文字內容
for string in tag.strings:
  print(repr(string))

# 返回標簽內所有的文字內容, 并去掉空行
for string in tag.stripped_strings:
  print(repr(string))

# 獲取到tag中包含的所有及包括子孫tag中的NavigableString內容，并以Unicode字符串格式輸出
tag.get_text()
## 以"|"分隔
tag.get_text("|")
## 以"|"分隔，不輸出空字符
tag.get_text("|", strip=True)
獲取子節(jié)點
tag.contents # 返回第一層子節(jié)點的列表
tag.children # 返回第一層子節(jié)點的listiterator 對象
for child in tag.children:
  print(child)

tag.descendants # 遞歸返回所有子節(jié)點
for child in tag.descendants:
  print(child)

獲取父節(jié)點

tag.parent # 返回第一層父節(jié)點標簽
tag.parents # 遞歸得到元素的所有父輩節(jié)點

for parent in tag.parents:
  if parent is None:
    print(parent)
  else:
    print(parent.name)

獲取兄弟節(jié)點

# 下一個兄弟元素
tag.next_sibling 

# 當前標簽之后的所有兄弟元素
tag.next_siblings
for sibling in tag.next_siblings:
  print(repr(sibling))

# 上一個兄弟元素
tag.previous_sibling

# 當前標簽之前的所有兄弟元素
tag.previous_siblings
for sibling in tag.previous_siblings:
  print(repr(sibling))

元素的遍歷

Beautiful Soup中把每個tag定義為一個“element”，每個“element”，被自上而下的在HTML中排列，可以通過遍歷命令逐個顯示標簽

# 當前標簽的下一個元素
tag.next_element

# 當前標簽之后的所有元素
for element in tag.next_elements:
  print(repr(element))

# 當前標簽的前一個元素
tag.previous_element
# 當前標簽之前的所有元素
for element in tag.previous_elements:
  print(repr(element))

修改標簽屬性

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b

tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1

tag.string = "New link text."
print(tag)

修改標簽內容（NavigableString)

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
tag.string = "New link text."

添加標簽內容（NavigableString)

soup = BeautifulSoup("<a>Foo</a>")
tag = soup.a
tag.append("Bar")
tag.contents

# 或者

new_string = NavigableString("Bar")
tag.append(new_string)
print(tag)

添加注釋(Comment)

注釋是一個特殊的NavigableString 對象，所以同樣可以通過append() 方法進行添加。

from bs4 import Comment
soup = BeautifulSoup("<a>Foo</a>")
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)
print(tag)

添加標簽(Tag)

添加標簽方法有兩種，一種是在指定標簽的內部添加（append方法），另一種是在指定位置添加(insert、insert_before、insert_after方法)

append方法

soup = BeautifulSoup("<b></b>")
tag = soup.b
new_tag = soup.new_tag("a",  rel="external nofollow" )
new_tag.string = "Link text."
tag.append(new_tag)
print(tag)

* insert方法，是指在當前標簽子節(jié)點列表的指定位置插入對象（Tag或NavigableString）

html = '<b><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.a
tag.contents
tag.insert(1, "but did not endorse ")
tag.contents

insert_before() 和 insert_after() 方法則在當前標簽之前或之后的兄弟節(jié)點添加元素

html = '<b><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.insert_before(tag)
soup.b

* wrap() 和 unwrap()可以對指定的tag元素進行包裝或解包,并返回包裝后的結果。

```python
# 添加包裝
soup = BeautifulSoup("<p>I wish I was bold.</p>")
soup.p.string.wrap(soup.new_tag("b"))
#輸出 <b>I wish I was bold.</b>

soup.p.wrap(soup.new_tag("div"))
#輸出 <div><p><b>I wish I was bold.</b></p></div>

# 拆解包裝
markup = '<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
a_tag = soup.a

a_tag.i.unwrap()
a_tag
#輸出 <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to example.com</a>

刪除標簽

html = '<b><a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >I linked to <i>example.com</i></a></b>'
soup = BeautifulSoup(html)
# 清楚當前標簽的所有子節(jié)點
soup.b.clear()

# 將當前標簽及所有子節(jié)點從soup 中移除,返回當前標簽。
b_tag=soup.b.extract()
b_tag
soup

# 將當前標簽及所有子節(jié)點從soup 中移除，無返回。
soup.b.decompose()

# 將當前標簽替換為指定的元素
tag=soup.i
new_tag = soup.new_tag("p")
new_tag.string = "Don't"
tag.replace_with(new_tag)

其他方法

輸出

# 格式化輸出
tag.prettify()
tag.prettify("latin-1")

使用Beautiful Soup解析后,文檔都被轉換成了Unicode，特殊字符也被轉換為Unicode，如果將文檔轉換成字符串,Unicode編碼會被編碼成UTF-8.這樣就無法正確顯示HTML特殊字符了
使用Unicode時,Beautiful Soup還會智能的把“引號”轉換成HTML或XML中的特殊字符

文檔編碼

使用Beautiful Soup解析后,文檔都被轉換成了Unicode，其使用了“編碼自動檢測”子庫來識別當前文檔編碼并轉換成Unicode編碼。

soup = BeautifulSoup(html)
soup.original_encoding

# 也可以手動指定文檔的編碼 
soup = BeautifulSoup(html, from_encoding="iso-8859-8")
soup.original_encoding

# 為提高“編碼自動檢測”的檢測效率，也可以預先排除一些編碼
soup = BeautifulSoup(markup, exclude_encodings=["ISO-8859-7"])
通過Beautiful Soup輸出文檔時,不管輸入文檔是什么編碼方式,默認輸出編碼均為UTF-8編碼
文檔解析器
Beautiful Soup目前支持, “l(fā)xml”, “html5lib”, 和 “html.parser”

soup=BeautifulSoup("<a><b /></a>")
soup
#輸出： <html><body><a><b></b></a></body></html>
soup=BeautifulSoup("<a></p>", "lxml")
soup
#輸出： <html><body><a></a></body></html>
soup=BeautifulSoup("<a></p>", "html5lib")
soup
#輸出： <html><head></head><body><a><p></p></a></body></html>
soup=BeautifulSoup("<a></p>", "html.parser")
soup
#輸出： <a></a>

參考文檔
https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章:

Python列表倒序輸出及其效率詳解
在本篇文章里小編給大家整理的是關于Python列表倒序輸出及其效率詳解內容，需要的朋友們學習下。
2020-03-03
python之yield表達式學習
這篇文章主要介紹了python之yield表達式學習,python中有一個略微奇怪的表達式叫yield expression，本文就來探究一下這是個什么東西,需要的朋友可以參考下
2014-09-09
Django MTV和MVC的區(qū)別詳解
這篇文章主要介紹了Django MTV和MVC的區(qū)別詳解，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2021-03-03
Python實現(xiàn)aes加密解密多種方法解析
這篇文章主要介紹了Python實現(xiàn)aes加密解密多種方法解析,文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下
2020-05-05
python如何保存文本文件
在本篇文章中小編給大家分享的是關于python保存文本文件的方法，有需要的朋友們可以參考下。
2020-06-06
Pygame中Sprite的使用方法示例詳解
這篇文章主要介紹了Pygame中Sprite的使用方法,本文通過示例代碼給大家介紹的非常詳細，對大家的學習或工作具有一定的參考借鑒價值，需要的朋友可以參考下
2023-09-09
pandas按條件篩選數(shù)據(jù)的實現(xiàn)
這篇文章主要介紹了pandas按條件篩選數(shù)據(jù)的實現(xiàn)，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2021-02-02
詳談python中冒號與逗號的區(qū)別
下面小編就為大家分享一篇詳談python中冒號與逗號的區(qū)別，具有很好的參考價值，希望對大家有所幫助。一起跟隨小編過來看看吧
2018-04-04
python實現(xiàn)按任意鍵繼續(xù)執(zhí)行程序
本文給大家分享的是如何使用Python腳本實現(xiàn)按任意鍵繼續(xù)執(zhí)行程序的代碼，非常的簡單實用，有需要的小伙伴可以參考下
2016-12-12
Python函數(shù)關鍵字參數(shù)及用法詳解
本文主要介紹了Python函數(shù)關鍵字參數(shù)及用法詳解，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧
2023-03-03