欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python爬蟲庫BeautifulSoup的介紹與簡單使用實例

 更新時間:2020年01月25日 15:29:20   作者:BQW_  
BeautifulSoup是一個可以從HTML或XML文件中提取數(shù)據(jù)的Python庫,本文為大家介紹下Python爬蟲庫BeautifulSoup的介紹與簡單使用實例其中包括了,BeautifulSoup解析HTML,BeautifulSoup獲取內(nèi)容,BeautifulSoup節(jié)點操作,BeautifulSoup獲取CSS屬性等實例

一、介紹

BeautifulSoup庫是靈活又方便的網(wǎng)頁解析庫,處理高效,支持多種解析器。利用它不用編寫正則表達式即可方便地實現(xiàn)網(wǎng)頁信息的提取。

Python常用解析庫

解析器 使用方法 優(yōu)勢 劣勢
Python標(biāo)準(zhǔn)庫 BeautifulSoup(markup, “html.parser”) Python的內(nèi)置標(biāo)準(zhǔn)庫、執(zhí)行速度適中 、文檔容錯能力強 Python 2.7.3 or 3.2.2)前的版本中文容錯能力差
lxml HTML 解析器 BeautifulSoup(markup, “l(fā)xml”) 速度快、文檔容錯能力強 需要安裝C語言庫
lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安裝C語言庫
html5lib BeautifulSoup(markup, “html5lib”) 最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔 速度慢、不依賴外部擴展

二、快速開始

給定html文檔,產(chǎn)生BeautifulSoup對象

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

輸出完整文本

print(soup.prettify())
<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
  <b>
  The Dormouse's story
  </b>
 </p>
 <p class="story">
  Once upon a time there were three little sisters; and their names were
  <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">
  Elsie
  </a>
  ,
  <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">
  Lacie
  </a>
  and
  <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">
  Tillie
  </a>
  ;
and they lived at the bottom of a well.
 </p>
 <p class="story">
  ...
 </p>
 </body>
</html>

瀏覽結(jié)構(gòu)化數(shù)據(jù)

print(soup.title) #<title>標(biāo)簽及內(nèi)容
print(soup.title.name) #<title>name屬性
print(soup.title.string) #<title>內(nèi)的字符串
print(soup.title.parent.name) #<title>的父標(biāo)簽name屬性(head)
print(soup.p) # 第一個<p></p>
print(soup.p['class']) #第一個<p></p>的class
print(soup.a) # 第一個<a></a>
print(soup.find_all('a')) # 所有<a></a>
print(soup.find(id="link3")) # 所有id='link3'的標(biāo)簽
<title>The Dormouse's story</title>
title
The Dormouse's story
head
<p class="title"><b>The Dormouse's story</b></p>
['title']
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>
[<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>, <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]
<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>

找出所有標(biāo)簽內(nèi)的鏈接

for link in soup.find_all('a'):
  print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

獲得所有文字內(nèi)容

print(soup.get_text())
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

自動補全標(biāo)簽并進行格式化

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.prettify())#格式化代碼,自動補全
print(soup.title.string)#得到title標(biāo)簽里的內(nèi)容

標(biāo)簽選擇器

選擇元素

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1"><!-- Elsie --></a>,
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> and
<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.title)#選擇了title標(biāo)簽
print(type(soup.title))#查看類型
print(soup.head)

獲取標(biāo)簽名稱

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.title.name)

獲取標(biāo)簽屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.p.attrs['name'])#獲取p標(biāo)簽中,name這個屬性的值
print(soup.p['name'])#另一種寫法,比較直接

獲取標(biāo)簽內(nèi)容

print(soup.p.string)

標(biāo)簽嵌套選擇

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.head.title.string)

子節(jié)點和子孫節(jié)點

html = """
<html>
  <head>
    <title>The Dormouse's story</title>
  </head>
  <body>
    <p class="story">
      Once upon a time there were three little sisters; and their names were
      <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">
        <span>Elsie</span>
      </a>
      <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link2">Lacie</a> 
      and
      <a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link3">Tillie</a>
      and they lived at the bottom of a well.
    </p>
    <p class="story">...</p>
"""


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.p.contents)#獲取指定標(biāo)簽的子節(jié)點,類型是list

另一個方法,child:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.p.children)#獲取指定標(biāo)簽的子節(jié)點的迭代器對象
for i,children in enumerate(soup.p.children):#i接受索引,children接受內(nèi)容
	print(i,children)

輸出結(jié)果與上面的一樣,多了一個索引。注意,只能用循環(huán)來迭代出子節(jié)點的信息。因為直接返回的只是一個迭代器對象。

獲取子孫節(jié)點:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.p.descendants)#獲取指定標(biāo)簽的子孫節(jié)點的迭代器對象
for i,child in enumerate(soup.p.descendants):#i接受索引,child接受內(nèi)容
	print(i,child)

父節(jié)點和祖先節(jié)點

parent

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(soup.a.parent)#獲取指定標(biāo)簽的父節(jié)點

parents

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(list(enumerate(soup.a.parents)))#獲取指定標(biāo)簽的祖先節(jié)點

兄弟節(jié)點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')#傳入解析器:lxml
print(list(enumerate(soup.a.next_siblings)))#獲取指定標(biāo)簽的后面的兄弟節(jié)點
print(list(enumerate(soup.a.previous_siblings)))#獲取指定標(biāo)簽的前面的兄弟節(jié)點

標(biāo)準(zhǔn)選擇器

find_all( name , attrs , recursive , text , **kwargs )

可根據(jù)標(biāo)簽名、屬性、內(nèi)容查找文檔。

name

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('ul'))#查找所有ul標(biāo)簽下的內(nèi)容
print(type(soup.find_all('ul')[0]))#查看其類型

下面的例子就是查找所有ul標(biāo)簽下的li標(biāo)簽:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.find_all('ul'):
  print(ul.find_all('li'))

attrs(屬性)

通過屬性進行元素的查找

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1" name="elements">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''


from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(attrs={'id': 'list-1'}))#傳入的是一個字典類型,也就是想要查找的屬性
print(soup.find_all(attrs={'name': 'elements'}))

查找到的是同樣的內(nèi)容,因為這兩個屬性是在同一個標(biāo)簽里面的。

特殊類型的參數(shù)查找:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))#id是個特殊的屬性,可以直接使用
print(soup.find_all(class_='element')) #class是關(guān)鍵字所以要用class_

text

根據(jù)文本內(nèi)容來進行選擇:

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))#查找文本為Foo的內(nèi)容,但是返回的不是標(biāo)簽

所以說這個text在做內(nèi)容匹配的時候比較方便,但是在做內(nèi)容查找的時候并不是太方便。

方法

find

find用法和findall一模一樣,但是返回的是找到的第一個符合條件的內(nèi)容輸出。

ind_parents(), find_parent()

find_parents()返回所有祖先節(jié)點,find_parent()返回直接父節(jié)點。

find_next_siblings() ,find_next_sibling()

find_next_siblings()返回后面的所有兄弟節(jié)點,find_next_sibling()返回后面的第一個兄弟節(jié)點

find_previous_siblings(),find_previous_sibling()

find_previous_siblings()返回前面所有兄弟節(jié)點,find_previous_sibling()返回前面第一個兄弟節(jié)點

find_all_next(),find_next()

find_all_next()返回節(jié)點后所有符合條件的節(jié)點,find_next()返回后面第一個符合條件的節(jié)點

find_all_previous(),find_previous()

find_all_previous()返回節(jié)點前所有符合條件的節(jié)點,find_previous()返回前面第一個符合條件的節(jié)點

CSS選擇器 通過select()直接傳入CSS選擇器即可完成選擇

html='''
<div class="panel">
  <div class="panel-heading">
    <h4>Hello</h4>
  </div>
  <div class="panel-body">
    <ul class="list" id="list-1">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
      <li class="element">Jay</li>
    </ul>
    <ul class="list list-small" id="list-2">
      <li class="element">Foo</li>
      <li class="element">Bar</li>
    </ul>
  </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.select('.panel .panel-heading'))#.代表class,中間需要空格來分隔
print(soup.select('ul li')) #選擇ul標(biāo)簽下面的li標(biāo)簽
print(soup.select('#list-2 .element')) #'#'代表id。這句的意思是查找id為"list-2"的標(biāo)簽下的,class=element的元素
print(type(soup.select('ul')[0]))#打印節(jié)點類型

再看看層層嵌套的選擇:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
	print(ul.select('li'))

獲取屬性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
  print(ul['id'])# 用[ ]即可獲取屬性
  print(ul.attrs['id'])#另一種寫法

獲取內(nèi)容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
  print(li.get_text())

用get_text()方法就能獲取內(nèi)容了。

總結(jié)

推薦使用lxml解析庫,必要時使用html.parser

標(biāo)簽選擇篩選功能弱但是速度快 建議使用find()、find_all() 查詢匹配單個結(jié)果或者多個結(jié)果

如果對CSS選擇器熟悉建議使用select()

記住常用的獲取屬性和文本值的方法

更多關(guān)于Python爬蟲庫BeautifulSoup的介紹與簡單使用實例請點擊下面的相關(guān)鏈接

相關(guān)文章

  • keras 獲取某層的輸入/輸出 tensor 尺寸操作

    keras 獲取某層的輸入/輸出 tensor 尺寸操作

    這篇文章主要介紹了keras 獲取某層的輸入/輸出 tensor 尺寸操作,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧
    2020-06-06
  • Python中pip安裝非PyPI官網(wǎng)第三方庫的方法

    Python中pip安裝非PyPI官網(wǎng)第三方庫的方法

    這篇文章主要介紹了Python中pip安裝非PyPI官網(wǎng)第三方庫的方法,pip最新的版本(1.5以上的版本), 出于安全的考 慮,pip不允許安裝非PyPI的URL,本文就給出兩種解決方法,需要的朋友可以參考下
    2015-06-06
  • python3 使用函數(shù)求兩個數(shù)的和與差

    python3 使用函數(shù)求兩個數(shù)的和與差

    這篇文章主要介紹了python3 使用函數(shù)求兩個數(shù)的和與差,具有很好的參考價值,希望對大家有所幫助。
    2021-05-05
  • Python如何獲取實時股票信息的方法示例

    Python如何獲取實時股票信息的方法示例

    本文主要介紹了Python如何獲取實時股票信息的方法示例,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧
    2022-06-06
  • Python編程argparse入門淺析

    Python編程argparse入門淺析

    這篇文章主要介紹了Python編程argparse入門淺析,分享了相關(guān)代碼,小編覺得還是挺不錯的,具有一定借鑒價值,需要的朋友可以參考下
    2018-02-02
  • python利用urllib和urllib2訪問http的GET/POST詳解

    python利用urllib和urllib2訪問http的GET/POST詳解

    urllib模塊提供的上層接口,使我們可以像讀取本地文件一樣讀取www和ftp上的數(shù)據(jù)。下面這篇文章主要給大家介紹了關(guān)于python如何利用urllib和urllib2訪問http的GET/POST的相關(guān)資料,需要的朋友可以參考借鑒,下面來一起看看吧。
    2017-09-09
  • python中split(),?os.path.split()和os.path.splitext()的用法

    python中split(),?os.path.split()和os.path.splitext()的用法

    本文主要介紹了python中split(),?os.path.split()和os.path.splitext()的用法,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧
    2023-02-02
  • scrapy與selenium結(jié)合爬取數(shù)據(jù)(爬取動態(tài)網(wǎng)站)的示例代碼

    scrapy與selenium結(jié)合爬取數(shù)據(jù)(爬取動態(tài)網(wǎng)站)的示例代碼

    這篇文章主要介紹了scrapy與selenium結(jié)合爬取數(shù)據(jù)的示例代碼,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧
    2020-09-09
  • Python中跨越多個文件使用全局變量的方法

    Python中跨越多個文件使用全局變量的方法

    全局變量是不屬于函數(shù)范圍的變量,可以在整個程序中使用,這表明全局變量也可以在函數(shù)體內(nèi)部或外部使用,這篇文章主要介紹了Python中跨越多個文件使用全局變量,需要的朋友可以參考下
    2023-09-09
  • Python中Qslider控件實操詳解

    Python中Qslider控件實操詳解

    在本篇文章里小編給大家整理的是一篇關(guān)于Python中Qslider控件實操詳解內(nèi)容,對此有興趣的朋友們可以跟著學(xué)習(xí)參考下。
    2021-02-02

最新評論