Python爬蟲之BeautifulSoup的基本使用教程
bs4的安裝
要使用BeautifulSoup4需要先安裝lxml,再安裝bs4
pip install lxml
pip install bs4
使用方法:
from bs4 import BeautifulSoup
lxml和bs4對比學習
from lxml import etree tree = etree.HTML(html) tree.xpath()
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc, 'lxml')
注意事項:
創(chuàng)建soup對象時如果不傳’lxml’或者features="lxml"會出現以下警告
bs4的快速入門
解析器的比較(了解即可)
解析器 | 用法 | 優(yōu)點 | 缺點 |
---|---|---|---|
python標準庫 | BeautifulSoup(markup,‘html.parser’) | python標準庫,執(zhí)行速度適中 | (在python2.7.3或3.2.2之前的版本中)文檔容錯能力差 |
lxml的HTML解析器 | BeautifulSoup(markup,‘lxml’) | 速度快,文檔容錯能力強 | 需要安裝c語言庫 |
lxml的XML解析器 | BeautifulSoup(markup,‘lxml-xml’)或者BeautifulSoup(markup,‘xml’) | 速度快,唯一支持XML的解析器 | 需要安裝c語言庫 |
html5lib | BeautifulSoup(markup,‘html5lib’) | 最好的容錯性,以瀏覽器的方式解析文檔,生成HTML5格式的文檔 | 速度慢,不依賴外部擴展 |
對象種類
Tag:標簽
BeautifulSoup:bs對象
NavigableString:可導航的字符串
Comment:注釋
from bs4 import BeautifulSoup # 創(chuàng)建模擬HTML代碼的字符串 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> <span><!--comment注釋內容舉例--></span> """ # 創(chuàng)建soup對象 soup = BeautifulSoup(html_doc, 'lxml') print(type(soup.title)) # <class 'bs4.element.Tag'> print(type(soup)) # <class 'bs4.BeautifulSoup'> print(type(soup.title.string)) # <class 'bs4.element.NavigableString'> print(type(soup.span.string)) # <class 'bs4.element.Comment'>
bs4的簡單使用
獲取標簽內容
from bs4 import BeautifulSoup # 創(chuàng)建模擬HTML代碼的字符串 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ # 創(chuàng)建soup對象 soup = BeautifulSoup(html_doc, 'lxml') print('head標簽內容:\n', soup.head) # 打印head標簽 print('body標簽內容:\n', soup.body) # 打印body標簽 print('html標簽內容:\n', soup.html) # 打印html標簽 print('p標簽內容:\n', soup.p) # 打印p標簽
注意:在打印p標簽對應的代碼時,可以發(fā)現只打印了第一個p標簽內容,這時我們可以通過find_all來獲取p標簽全部內容
print('p標簽內容:\n', soup.find_all('p'))
?這里需要注意使用find_all里面必須傳入的是字符串
獲取標簽名字
通過name屬性獲取標簽名字
from bs4 import BeautifulSoup # 創(chuàng)建模擬HTML代碼的字符串 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ # 創(chuàng)建soup對象 soup = BeautifulSoup(html_doc, 'lxml') print('head標簽名字:\n', soup.head.name) # 打印head標簽名字 print('body標簽名字:\n', soup.body.name) # 打印body標簽名字 print('html標簽名字:\n', soup.html.name) # 打印html標簽名字 print('p標簽名字:\n', soup.find_all('p').name) # 打印p標簽名字
如果要找到兩個標簽的內容,需要傳入列表過濾器,而不是字符串過濾器
使用字符串過濾器獲取多個標簽內容會返回空列表
print(soup.find_all('title', 'p'))
[]
需要使用列表過濾器獲取多個標簽內容
print(soup.find_all(['title', 'p']))
[<title>The Dormouse's story</title>, <p class="title"><b>The Dormouse's story</b></p>, <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>, <p class="story">...</p>]
獲取a標簽的href屬性值
from bs4 import BeautifulSoup # 創(chuàng)建模擬HTML代碼的字符串 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 創(chuàng)建soup對象 soup = BeautifulSoup(html_doc, 'lxml') a_list = soup.find_all('a') # 遍歷列表取屬性值 for a in a_list: # 第一種方法通過get去獲取href屬性值(沒有找到返回None) print(a.get('href')) # 第二種方法先通過attrs獲取所有屬性值,再提取出你想要的屬性值 print(a.attrs['href']) # 第三種方法獲取沒有的屬性值會報錯 print(a['href'])
擴展:使用prettify()美化 讓節(jié)點層級關系更加明顯 方便分析
print(soup.prettify())
不使用prettify時的代碼
<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body></html>
遍歷文檔樹
from bs4 import BeautifulSoup # 創(chuàng)建模擬HTML代碼的字符串 html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a class="sister" id="link1">Elsie</a>, <a class="sister" id="link2">Lacie</a> and <a class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ soup = BeautifulSoup(html_doc, 'lxml') head = soup.head # contents返回的是所有子節(jié)點的列表 [<title>The Dormouse's story</title>] print(head.contents) # children返回的是一個子節(jié)點的迭代器 <list_iterator object at 0x00000264BADC2748> print(head.children) # 凡是迭代器都是可以遍歷的 for h in head.children: print(h) html = soup.html # 會把換行也當作子節(jié)點匹配到 # descendants 返回的是一個生成器遍歷子子孫孫 <generator object Tag.descendants at 0x0000018C15BFF4C8> print(html.descendants) # 凡是生成器都是可遍歷的 for h in html.descendants: print(h) ''' 需要重點掌握的 string獲取標簽里面的內容 strings 返回是一個生成器對象用過來獲取多個標簽內容 stripped_strings 和strings基本一致 但是它可以把多余的空格去掉 ''' print(soup.title.string) print(soup.html.string) # 返回生成器對象<generator object Tag._all_strings at 0x000001AAFF9EF4C8> # soup.html.strings 包含在html標簽里面的文本都會被獲取到 print(soup.html.strings) for h in soup.html.strings: print(h) # stripped_strings可以把多余的空格去掉 # 返回生成器對象<generator object PageElement.stripped_strings at 0x000001E31284F4C8> print(soup.html.stripped_strings) for h in soup.html.stripped_strings: print(h) ''' parent直接獲得父節(jié)點 parents獲取所有的父節(jié)點 ''' title = soup.title # parent找直接父節(jié)點 print(title.parent) # parents獲取所有父節(jié)點 # 返回生成器對象<generator object PageElement.parents at 0x000001F02049F4C8> print(title.parents) for p in title.parents: print(p) # html的父節(jié)點就是整個文檔 print(soup.html.parent) # <class 'bs4.BeautifulSoup'> print(type(soup.html.parent))
案例練習
獲取所有職位名稱
html = """ <table class="tablelist" cellpadding="0" cellspacing="0"> <tbody> <tr class="h"> <td class="l" width="374">職位名稱</td> <td>職位類別</td> <td>人數</td> <td>地點</td> <td>發(fā)布時間</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級研發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級后臺開發(fā)</a></td> <td>技術類</td> <td>2</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂運營開發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>2</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂業(yè)務運維工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級研發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級圖像算法研發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級AI開發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>4</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級業(yè)務運維工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> </tbody> </table> """
思路
不難看出想要的數據在tr節(jié)點的a標簽里,只需要遍歷所有的tr節(jié)點,從遍歷出來的tr節(jié)點取a標簽里面的文本數據
代碼實現
from bs4 import BeautifulSoup html = """ <table class="tablelist" cellpadding="0" cellspacing="0"> <tbody> <tr class="h"> <td class="l" width="374">職位名稱</td> <td>職位類別</td> <td>人數</td> <td>地點</td> <td>發(fā)布時間</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云區(qū)塊鏈高級研發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高級后臺開發(fā)</a></td> <td>技術類</td> <td>2</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂運營開發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>2</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-騰訊音樂業(yè)務運維工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高級研發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高級圖像算法研發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高級AI開發(fā)工程師(深圳)</a></td> <td>技術類</td> <td>4</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后臺開發(fā)工程師</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高級業(yè)務運維工程師(深圳)</a></td> <td>技術類</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> </tbody> </table> """ # 創(chuàng)建soup對象 soup = BeautifulSoup(html, 'lxml') # 使用find_all()找到所有的tr節(jié)點(經過觀察第一個tr節(jié)點為表頭,忽略不計) tr_list = soup.find_all('tr')[1:] # 遍歷tr_list取a標簽里的文本數據 for tr in tr_list: a_list = tr.find_all('a') print(a_list[0].string)
運行結果如下:
22989-金融云區(qū)塊鏈高級研發(fā)工程師(深圳)
22989-金融云高級后臺開發(fā)
SNG16-騰訊音樂運營開發(fā)工程師(深圳)
SNG16-騰訊音樂業(yè)務運維工程師(深圳)
TEG03-高級研發(fā)工程師(深圳)
TEG03-高級圖像算法研發(fā)工程師(深圳)
TEG11-高級AI開發(fā)工程師(深圳)
15851-后臺開發(fā)工程師
15851-后臺開發(fā)工程師
SNG11-高級業(yè)務運維工程師(深圳)
總結
到此這篇關于Python爬蟲之BeautifulSoup基本使用的文章就介紹到這了,更多相關Python BeautifulSoup使用內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家!
- Python使用Beautiful?Soup(BS4)庫解析HTML和XML
- Python使用BeautifulSoup4修改網頁內容的實戰(zhàn)記錄
- python?beautifulsoup4?模塊詳情
- python?中的?BeautifulSoup?網頁使用方法解析
- Python中BeautifulSoup模塊詳解
- Python爬取求職網requests庫和BeautifulSoup庫使用詳解
- Python實戰(zhàn)快速上手BeautifulSoup庫爬取專欄標題和地址
- python數據解析BeautifulSoup爬取三國演義章節(jié)示例
- python爬蟲beautiful?soup的使用方式
相關文章
在Python中使用defaultdict初始化字典以及應用方法
今天小編就為大家分享一篇在Python中使用defaultdict初始化字典以及應用方法,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-10-10Python關于print的操作(倒計時、轉圈顯示、進度條)
這篇文章主要介紹了Python關于print的操作(倒計時、轉圈顯示、進度條),具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教2023-05-05Python Flask token身份認證的示例代碼(附完整代碼)
在Web應用中,經常需要進行身份認證,以確保只有授權用戶才能訪問某些資源,本文主要介紹了Python Flask token身份認證的示例代碼,具有一定的參考價值,感興趣的可以了解一下2023-11-11