Python爬蟲庫BeautifulSoup的介紹與簡單使用實例
一、介紹
BeautifulSoup庫是靈活又方便的網(wǎng)頁解析庫,處理高效,支持多種解析器。利用它不用編寫正則表達式即可方便地實現(xiàn)網(wǎng)頁信息的提取。
Python常用解析庫
解析器 | 使用方法 | 優(yōu)勢 | 劣勢 |
Python標(biāo)準(zhǔn)庫 | BeautifulSoup(markup, “html.parser”) | Python的內(nèi)置標(biāo)準(zhǔn)庫、執(zhí)行速度適中 、文檔容錯能力強 | Python 2.7.3 or 3.2.2)前的版本中文容錯能力差 |
lxml HTML 解析器 | BeautifulSoup(markup, “l(fā)xml”) | 速度快、文檔容錯能力強 | 需要安裝C語言庫 |
lxml XML 解析器 | BeautifulSoup(markup, “xml”) | 速度快、唯一支持XML的解析器 | 需要安裝C語言庫 |
html5lib | BeautifulSoup(markup, “html5lib”) | 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 | 速度慢、不依賴外部擴展 |
二、快速開始
給定html文檔,產(chǎn)生BeautifulSoup對象
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_doc,'lxml')
輸出完整文本
print(soup.prettify())
<html> <head> <title> The Dormouse's story </title> </head> <body> <p class="title"> <b> The Dormouse's story </b> </p> <p class="story"> Once upon a time there were three little sisters; and their names were <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1"> Elsie </a> , <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2"> Lacie </a> and <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3"> Tillie </a> ; and they lived at the bottom of a well. </p> <p class="story"> ... </p> </body> </html>
瀏覽結(jié)構(gòu)化數(shù)據(jù)
print(soup.title) #<title>標(biāo)簽及內(nèi)容 print(soup.title.name) #<title>name屬性 print(soup.title.string) #<title>內(nèi)的字符串 print(soup.title.parent.name) #<title>的父標(biāo)簽name屬性(head) print(soup.p) # 第一個<p></p> print(soup.p['class']) #第一個<p></p>的class print(soup.a) # 第一個<a></a> print(soup.find_all('a')) # 所有<a></a> print(soup.find(id="link3")) # 所有id='link3'的標(biāo)簽
<title>The Dormouse's story</title> title The Dormouse's story head <p class="title"><b>The Dormouse's story</b></p> ['title'] <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a> [<a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>] <a class="sister" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>
for link in soup.find_all('a'): print(link.get('href'))
http://example.com/elsie http://example.com/lacie http://example.com/tillie
獲得所有文字內(nèi)容
print(soup.get_text())
The Dormouse's story The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...
自動補全標(biāo)簽并進行格式化
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.prettify())#格式化代碼,自動補全 print(soup.title.string)#得到title標(biāo)簽里的內(nèi)容
標(biāo)簽選擇器
選擇元素
html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>, <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.title)#選擇了title標(biāo)簽 print(type(soup.title))#查看類型 print(soup.head)
獲取標(biāo)簽名稱
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.title.name)
獲取標(biāo)簽屬性
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.attrs['name'])#獲取p標(biāo)簽中,name這個屬性的值 print(soup.p['name'])#另一種寫法,比較直接
獲取標(biāo)簽內(nèi)容
print(soup.p.string)
標(biāo)簽嵌套選擇
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.head.title.string)
子節(jié)點和子孫節(jié)點
html = """ <html> <head> <title>The Dormouse's story</title> </head> <body> <p class="story"> Once upon a time there were three little sisters; and their names were <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"> <span>Elsie</span> </a> <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and <a rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a> and they lived at the bottom of a well. </p> <p class="story">...</p> """ from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.contents)#獲取指定標(biāo)簽的子節(jié)點,類型是list
另一個方法,child:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.children)#獲取指定標(biāo)簽的子節(jié)點的迭代器對象 for i,children in enumerate(soup.p.children):#i接受索引,children接受內(nèi)容 print(i,children)
輸出結(jié)果與上面的一樣,多了一個索引。注意,只能用循環(huán)來迭代出子節(jié)點的信息。因為直接返回的只是一個迭代器對象。
獲取子孫節(jié)點:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.p.descendants)#獲取指定標(biāo)簽的子孫節(jié)點的迭代器對象 for i,child in enumerate(soup.p.descendants):#i接受索引,child接受內(nèi)容 print(i,child)
父節(jié)點和祖先節(jié)點
parent
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(soup.a.parent)#獲取指定標(biāo)簽的父節(jié)點
parents
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(list(enumerate(soup.a.parents)))#獲取指定標(biāo)簽的祖先節(jié)點
兄弟節(jié)點
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml print(list(enumerate(soup.a.next_siblings)))#獲取指定標(biāo)簽的后面的兄弟節(jié)點 print(list(enumerate(soup.a.previous_siblings)))#獲取指定標(biāo)簽的前面的兄弟節(jié)點
標(biāo)準(zhǔn)選擇器
find_all( name , attrs , recursive , text , **kwargs )
可根據(jù)標(biāo)簽名、屬性、內(nèi)容查找文檔。
name
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all('ul'))#查找所有ul標(biāo)簽下的內(nèi)容 print(type(soup.find_all('ul')[0]))#查看其類型
下面的例子就是查找所有ul標(biāo)簽下的li標(biāo)簽:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.find_all('ul'): print(ul.find_all('li'))
attrs(屬性)
通過屬性進行元素的查找
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1" name="elements"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典類型,也就是想要查找的屬性 print(soup.find_all(attrs={'name': 'elements'}))
查找到的是同樣的內(nèi)容,因為這兩個屬性是在同一個標(biāo)簽里面的。
特殊類型的參數(shù)查找:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(id='list-1'))#id是個特殊的屬性,可以直接使用 print(soup.find_all(class_='element')) #class是關(guān)鍵字所以要用class_
text
根據(jù)文本內(nèi)容來進行選擇:
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.find_all(text='Foo'))#查找文本為Foo的內(nèi)容,但是返回的不是標(biāo)簽
所以說這個text在做內(nèi)容匹配的時候比較方便,但是在做內(nèi)容查找的時候并不是太方便。
方法
find
find用法和findall一模一樣,但是返回的是找到的第一個符合條件的內(nèi)容輸出。
ind_parents(), find_parent()
find_parents()返回所有祖先節(jié)點,find_parent()返回直接父節(jié)點。
find_next_siblings() ,find_next_sibling()
find_next_siblings()返回后面的所有兄弟節(jié)點,find_next_sibling()返回后面的第一個兄弟節(jié)點
find_previous_siblings(),find_previous_sibling()
find_previous_siblings()返回前面所有兄弟節(jié)點,find_previous_sibling()返回前面第一個兄弟節(jié)點
find_all_next(),find_next()
find_all_next()返回節(jié)點后所有符合條件的節(jié)點,find_next()返回后面第一個符合條件的節(jié)點
find_all_previous(),find_previous()
find_all_previous()返回節(jié)點前所有符合條件的節(jié)點,find_previous()返回前面第一個符合條件的節(jié)點
CSS選擇器 通過select()直接傳入CSS選擇器即可完成選擇
html=''' <div class="panel"> <div class="panel-heading"> <h4>Hello</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">Foo</li> <li class="element">Bar</li> <li class="element">Jay</li> </ul> <ul class="list list-small" id="list-2"> <li class="element">Foo</li> <li class="element">Bar</li> </ul> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') print(soup.select('.panel .panel-heading'))#.代表class,中間需要空格來分隔 print(soup.select('ul li')) #選擇ul標(biāo)簽下面的li標(biāo)簽 print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標(biāo)簽下的,class=element的元素 print(type(soup.select('ul')[0]))#打印節(jié)點類型
再看看層層嵌套的選擇:
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul.select('li'))
獲取屬性
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for ul in soup.select('ul'): print(ul['id'])# 用[ ]即可獲取屬性 print(ul.attrs['id'])#另一種寫法
獲取內(nèi)容
from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') for li in soup.select('li'): print(li.get_text())
用get_text()方法就能獲取內(nèi)容了。
總結(jié)
推薦使用lxml解析庫,必要時使用html.parser
標(biāo)簽選擇篩選功能弱但是速度快 建議使用find()、find_all() 查詢匹配單個結(jié)果或者多個結(jié)果
如果對CSS選擇器熟悉建議使用select()
記住常用的獲取屬性和文本值的方法
更多關(guān)于Python爬蟲庫BeautifulSoup的介紹與簡單使用實例請點擊下面的相關(guān)鏈接
- Python實戰(zhàn)快速上手BeautifulSoup庫爬取專欄標(biāo)題和地址
- python beautiful soup庫入門安裝教程
- python爬蟲學(xué)習(xí)筆記--BeautifulSoup4庫的使用詳解
- Python爬蟲進階之Beautiful Soup庫詳解
- python BeautifulSoup庫的安裝與使用
- python用BeautifulSoup庫簡單爬蟲實例分析
- python3解析庫BeautifulSoup4的安裝配置與基本用法
- python3第三方爬蟲庫BeautifulSoup4安裝教程
- Python使用Beautiful?Soup(BS4)庫解析HTML和XML
相關(guān)文章
Python中pip安裝非PyPI官網(wǎng)第三方庫的方法
這篇文章主要介紹了Python中pip安裝非PyPI官網(wǎng)第三方庫的方法,pip最新的版本(1.5以上的版本), 出于安全的考 慮,pip不允許安裝非PyPI的URL,本文就給出兩種解決方法,需要的朋友可以參考下2015-06-06python3 使用函數(shù)求兩個數(shù)的和與差
這篇文章主要介紹了python3 使用函數(shù)求兩個數(shù)的和與差,具有很好的參考價值,希望對大家有所幫助。2021-05-05python利用urllib和urllib2訪問http的GET/POST詳解
urllib模塊提供的上層接口,使我們可以像讀取本地文件一樣讀取www和ftp上的數(shù)據(jù)。下面這篇文章主要給大家介紹了關(guān)于python如何利用urllib和urllib2訪問http的GET/POST的相關(guān)資料,需要的朋友可以參考借鑒,下面來一起看看吧。2017-09-09python中split(),?os.path.split()和os.path.splitext()的用法
本文主要介紹了python中split(),?os.path.split()和os.path.splitext()的用法,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2023-02-02scrapy與selenium結(jié)合爬取數(shù)據(jù)(爬取動態(tài)網(wǎng)站)的示例代碼
這篇文章主要介紹了scrapy與selenium結(jié)合爬取數(shù)據(jù)的示例代碼,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2020-09-09