BeautifulSoup中find和find_all的使用詳解

更新時間：2020年12月07日 10:13:25 作者：OCISLU

這篇文章主要介紹了BeautifulSoup中find和find_all的使用詳解，文中通過示例代碼介紹的非常詳細(xì)，對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值，需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧

爬蟲利器BeautifulSoup中find和find_all的使用方法

二話不說，先上段HTML例子

<html>
  <head>
    <title>
      index
    </title>
  </head>
  <body>
     <div>
        <ul>
           <li id="flask"class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
          <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
          <li class="item-inactie"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
          <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
          <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
         </ul>
     </div>
    <li> hello world </li>
  </body>
</html>

使用BeautifulSoup前需要先構(gòu)建BeautifulSoup實(shí)例

# 構(gòu)建beautifulsoup實(shí)例
soup = BeautifulSoup(html,'lxml')
# 第一個參數(shù)是要匹配的內(nèi)容
# 第二個參數(shù)是beautifulsoup要采用的模塊，即規(guī)則

需要注意的是，導(dǎo)入對的模塊需要事先安裝，此處導(dǎo)入的LXML事先已經(jīng)安裝。可以導(dǎo)入的模塊可通過查詢BeautifulSoup的文檔查看

第一次插入圖片，那，我表個白，我超愛我女朋友呼延羿彤~~

接下來是find和find_all的介紹

1. find
只返回第一個匹配到的對象
語法：

find(name, attrs, recursive, text, **wargs)　　　　
# recursive 遞歸的，循環(huán)的

BeautifulSoup的find方法

參數(shù)：

參數(shù)名	作用
name	查找標(biāo)簽
text	查找文本
attrs	基于attrs參數(shù)

例子：

# find查找一次
li = soup.find('li')
print('find_li:',li)
print('li.text(返回標(biāo)簽的內(nèi)容):',li.text)
print('li.attrs(返回標(biāo)簽的屬性):',li.attrs)
print('li.string(返回標(biāo)簽內(nèi)容為字符串):',li.string)

運(yùn)行結(jié)果：

find_li: <li class="item-0" id="flask"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
li.text(返回標(biāo)簽的內(nèi)容): first item
li.attrs(返回標(biāo)簽的屬性): {'id': 'flask', 'class': ['item-0']}
li.string(返回標(biāo)簽內(nèi)容為字符串): first item

find也可以通過‘屬性=值'的方法進(jìn)行匹配

li = soup.find(id = 'flask')
print(li,'\n')

<li class="item-0" id="flask"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>

需要注意的是，因?yàn)閏lass是python的保留關(guān)鍵字，若要匹配標(biāo)簽內(nèi)class的屬性，需要特殊的方法，有以下兩種：

在attrs屬性用字典的方式進(jìn)行參數(shù)傳遞
BeautifulSoup自帶的特別關(guān)鍵字class_

# 第一種:在attrs屬性用字典進(jìn)行傳遞參數(shù)
find_class = soup.find(attrs={'class':'item-1'})
print('findclass:',find_class,'\n')
# 第二種:BeautifulSoup中的特別關(guān)鍵字參數(shù)class_
beautifulsoup_class_ = soup.find(class_ = 'item-1')
print('BeautifulSoup_class_:',beautifulsoup_class_,'\n')

運(yùn)行結(jié)果

findclass: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

BeautifulSoup_class_: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>

2. find_all

返回所有匹配到的結(jié)果，區(qū)別于find（find只返回查找到的第一個結(jié)果）

語法：

find_all(name, attrs, recursive, text, limit, **kwargs)

BeautifulSoup的find_all方法

參數(shù)名	作用
name	查找標(biāo)簽
text	查找文本
attrs	基于attrs參數(shù)

與find一樣的語法

上代碼

# find_all 查找所有
li_all = soup.find_all('li')
for li_all in li_all:
	print('---')
	print('匹配到的li:',li_all)
	print('li的內(nèi)容:',li_all.text)
	print('li的屬性:',li_all.attrs)

運(yùn)行結(jié)果：

---
匹配到的li: <li class="item-0" id="flask"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
li的內(nèi)容: first item
li的屬性: {'id': 'flask', 'class': ['item-0']}
---
匹配到的li: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
li的內(nèi)容: second item
li的屬性: {'class': ['item-1']}
---
匹配到的li: <li cvlass="item-inactie"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
li的內(nèi)容: third item
li的屬性: {'cvlass': 'item-inactie'}
---
匹配到的li: <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
li的內(nèi)容: fourth item
li的屬性: {'class': ['item-1']}
---
匹配到的li: <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
</li>
li的內(nèi)容: fifth item

附上比較靈活的find_all查詢方法：

# 最靈活的使用方式
li_quick = soup.find_all(attrs={'class':'item-1'})
for li_quick in li_quick:
	print('最靈活的查找方法:',li_quick)

運(yùn)行結(jié)果：

最靈活的查找方法: <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
最靈活的查找方法: <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>

完整代碼：

# coding=utf8
# @Author= CaiJunxuan
# @QQ=469590490
# @Wechat:15916454524

# beautifulsoup

# 導(dǎo)入beautifulsoup模塊
from bs4 import BeautifulSoup

# HTML例子
html = '''
<html>
  <head>
    <title>
      index
    </title>
  </head>
  <body>
     <div>
        <ul>
           <li id="flask"class="item-0"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
          <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
          <li cvlass="item-inactie"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
          <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
          <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
         </ul>
     </div>
    <li> hello world </li>
  </body>
</html>
'''

# 構(gòu)建beautifulsoup實(shí)例
soup = BeautifulSoup(html,'lxml')
# 第一個參數(shù)是要匹配的內(nèi)容
# 第二個參數(shù)是beautifulsoup要采用的模塊,即規(guī)則
# html.parser是python內(nèi)置的結(jié)構(gòu)匹配方法，但是效率不如lxml所以不常用
# lxml 采用lxml模塊
# html5lib,該模塊可以將內(nèi)容轉(zhuǎn)換成html5對象
# 若想要以上功能,就需要具備對應(yīng)的模塊，比如使用lxml就要安裝lxml

# 在bs4當(dāng)中有很多種匹配方法,但常用有兩種:

# find查找一次
li = soup.find('li')
print('find_li:',li)
print('li.text(返回標(biāo)簽的內(nèi)容):',li.text)
print('li.attrs(返回標(biāo)簽的屬性):',li.attrs)
print('li.string(返回標(biāo)簽內(nèi)容為字符串):',li.string)
print(50*'*','\n')

# find可以通過'屬性 = 值'的方法進(jìn)行select
li = soup.find(id = 'flask')
print(li,'\n')
# 因?yàn)閏lass是python的保留關(guān)鍵字，所以無法直接查找class這個關(guān)鍵字
# 有兩種方法可以進(jìn)行class屬性查詢
# 第一種:在attrs屬性用字典進(jìn)行傳遞參數(shù)
find_class = soup.find(attrs={'class':'item-1'})
print('findclass:',find_class,'\n')
# 第二種:BeautifulSoup中的特別關(guān)鍵字參數(shù)class_
beautifulsoup_class_ = soup.find(class_ = 'item-1')
print('BeautifulSoup_class_:',beautifulsoup_class_,'\n')

# find_all 查找所有
li_all = soup.find_all('li')
for li_all in li_all:
	print('---')
	print('匹配到的li:',li_all)
	print('li的內(nèi)容:',li_all.text)
	print('li的屬性:',li_all.attrs)

# 最靈活的使用方式
li_quick = soup.find_all(attrs={'class':'item-1'})
for li_quick in li_quick:
	print('最靈活的查找方法:',li_quick)

到此這篇關(guān)于BeautifulSoup中find和find_all的使用詳解的文章就介紹到這了,更多相關(guān)BeautifulSoup find和find_all內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: