Python爬蟲庫BeautifulSoup的介紹與簡單使用實例

更新時間：2020年01月25日 15:29:20 作者：BQW_

BeautifulSoup是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫，本文為大家介紹下Python爬蟲庫BeautifulSoup的介紹與簡單使用實例其中包括了，BeautifulSoup解析HTML，BeautifulSoup獲取內(nèi)容，BeautifulSoup節(jié)點操作，BeautifulSoup獲取CSS屬性等實例

一、介紹

BeautifulSoup庫是靈活又方便的網(wǎng)頁解析庫，處理高效，支持多種解析器。利用它不用編寫正則表達式即可方便地實現(xiàn)網(wǎng)頁信息的提取。

Python常用解析庫

解析器	使用方法	優(yōu)勢	劣勢
Python標準庫	BeautifulSoup(markup, “html.parser”)	Python的內(nèi)置標準庫、執(zhí)行速度適中、文檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器	BeautifulSoup(markup, “l(fā)xml”)	速度快、文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	BeautifulSoup(markup, “xml”)	速度快、唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup, “html5lib”)	最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔	速度慢、不依賴外部擴展

二、快速開始

給定html文檔，產(chǎn)生BeautifulSoup對象

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

輸出完整文本

print(soup.prettify())

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>
  ,
  <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

瀏覽結(jié)構(gòu)化數(shù)據(jù)

print(soup.title) #<title>標簽及內(nèi)容
print(soup.title.name) #<title>name屬性
print(soup.title.string) #<title>內(nèi)的字符串
print(soup.title.parent.name) #<title>的父標簽name屬性(head)
print(soup.p) # 第一個<p></p>
print(soup.p['class']) #第一個<p></p>的class
print(soup.a) # 第一個<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的標簽

<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

找出所有標簽內(nèi)的鏈接

for link in soup.find_all('a'):
  print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

獲得所有文字內(nèi)容

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自動補全標簽并進行格式化

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.prettify())#格式化代碼，自動補全
print(soup.title.string)#得到title標簽里的內(nèi)容

標簽選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title)#選擇了title標簽
print(type(soup.title))#查看類型
print(soup.head)

獲取標簽名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.title.name)

獲取標簽屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.attrs['name'])#獲取p標簽中，name這個屬性的值
print(soup.p['name'])#另一種寫法，比較直接

獲取標簽內(nèi)容

print(soup.p.string)

標簽嵌套選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.head.title.string)

子節(jié)點和子孫節(jié)點

html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.contents)#獲取指定標簽的子節(jié)點，類型是list

另一個方法，child：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.children)#獲取指定標簽的子節(jié)點的迭代器對象
for i,children in enumerate(soup.p.children):#i接受索引，children接受內(nèi)容
	print(i,children)

輸出結(jié)果與上面的一樣，多了一個索引。注意，只能用循環(huán)來迭代出子節(jié)點的信息。因為直接返回的只是一個迭代器對象。

獲取子孫節(jié)點：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.p.descendants)#獲取指定標簽的子孫節(jié)點的迭代器對象
for i,child in enumerate(soup.p.descendants):#i接受索引，child接受內(nèi)容
	print(i,child)

父節(jié)點和祖先節(jié)點

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(soup.a.parent)#獲取指定標簽的父節(jié)點

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.parents)))#獲取指定標簽的祖先節(jié)點

兄弟節(jié)點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器：lxml
print(list(enumerate(soup.a.next_siblings)))#獲取指定標簽的后面的兄弟節(jié)點
print(list(enumerate(soup.a.previous_siblings)))#獲取指定標簽的前面的兄弟節(jié)點

標準選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據(jù)標簽名、屬性、內(nèi)容查找文檔。

name

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查找所有ul標簽下的內(nèi)容
print(type(soup.find_all('ul')[0]))#查看其類型

下面的例子就是查找所有ul標簽下的li標簽：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs（屬性）

通過屬性進行元素的查找

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典類型，也就是想要查找的屬性
print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同樣的內(nèi)容，因為這兩個屬性是在同一個標簽里面的。

特殊類型的參數(shù)查找：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是個特殊的屬性，可以直接使用
print(soup.find_all(class_='element')) #class是關(guān)鍵字所以要用class_

text

根據(jù)文本內(nèi)容來進行選擇：

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本為Foo的內(nèi)容，但是返回的不是標簽

所以說這個text在做內(nèi)容匹配的時候比較方便，但是在做內(nèi)容查找的時候并不是太方便。

方法

find

find用法和findall一模一樣，但是返回的是找到的第一個符合條件的內(nèi)容輸出。

ind_parents()， find_parent()

find_parents()返回所有祖先節(jié)點，find_parent()返回直接父節(jié)點。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回后面的所有兄弟節(jié)點，find_next_sibling()返回后面的第一個兄弟節(jié)點

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節(jié)點,find_previous_sibling()返回前面第一個兄弟節(jié)點

find_all_next(),find_next()

find_all_next()返回節(jié)點后所有符合條件的節(jié)點，find_next()返回后面第一個符合條件的節(jié)點

find_all_previous(),find_previous()

find_all_previous()返回節(jié)點前所有符合條件的節(jié)點，find_previous()返回前面第一個符合條件的節(jié)點

CSS選擇器通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class，中間需要空格來分隔
print(soup.select('ul li')) #選擇ul標簽下面的li標簽
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標簽下的，class=element的元素
print(type(soup.select('ul')[0]))#打印節(jié)點類型

再看看層層嵌套的選擇：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可獲取屬性
  print(ul.attrs['id'])#另一種寫法

獲取內(nèi)容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text（）方法就能獲取內(nèi)容了。

總結(jié)

推薦使用lxml解析庫，必要時使用html.parser

標簽選擇篩選功能弱但是速度快建議使用find()、find_all() 查詢匹配單個結(jié)果或者多個結(jié)果

如果對CSS選擇器熟悉建議使用select()

記住常用的獲取屬性和文本值的方法

更多關(guān)于Python爬蟲庫BeautifulSoup的介紹與簡單使用實例請點擊下面的相關(guān)鏈接

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python爬蟲庫BeautifulSoup的介紹與簡單使用實例

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具