腳本之家服務器常用軟件

快捷導航

詳解BeautifulSoup獲取特定標簽下內(nèi)容的方法

更新時間：2020年12月07日 11:40:41 作者：qianc6350528

這篇文章主要介紹了詳解BeautifulSoup獲取特定標簽下內(nèi)容的方法，文中通過示例代碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值，需要的朋友們下面隨著小編來一起學習學習吧

以下是個人在學習beautifulSoup過程中的一些總結(jié)，目前我在使用爬蟲數(shù)據(jù)時使用的方法的是：先用find_all()找出需要內(nèi)容所在的標簽，如果所需內(nèi)容一個find_all()不能滿足，那就用兩個或者多個。接下來遍歷find_all的結(jié)果，用get_txt（）、get(‘href')、得到文本或者鏈接，然后放入各自的列表中。這樣做有一個缺點就是txt的數(shù)據(jù)是一個單獨的列表，鏈接的數(shù)據(jù)也是一個單獨的列表，一方面不能體現(xiàn)這些數(shù)據(jù)之間的結(jié)構(gòu)性，另一方面當想要獲得更多的內(nèi)容時，就要創(chuàng)建更多的空列表。

遍歷所有標簽：

soup.find_all('a')

找出所有頁面中含有標簽a的html語句，結(jié)果以列表形式存儲。對找到的標簽可以進一步處理，如用for對結(jié)果遍歷，可以對結(jié)果進行purify，得到如鏈接，字符等結(jié)果。

# 創(chuàng)建空列表
links=[] 
txts=[]
tags=soup.find_all('a')
for tag in tags:
  links.append(tag.get('href')
  txts.append(tag.txt)         #或者txts.append(tag.get_txt())

得到html的屬性名：

atr=[]
tags=soup.find_all('a')
for tag in tags:
  atr.append(tag.p('class')) # 得到a 標簽下，子標簽p的class名稱

find_all()的相關用法實例：

實例來自BeautifulSoup中文文檔

1. 字符串

最簡單的過濾器是字符串.在搜索方法中傳入一個字符串參數(shù),Beautiful Soup會查找與字符串完整匹配的內(nèi)容,下面的例子用于查找文檔中所有的標簽:

soup.find_all('b')
# [<b>The Dormouse's story</b>]

2.正則表達式

如果傳入正則表達式作為參數(shù),Beautiful Soup會通過正則表達式的 match() 來匹配內(nèi)容.下面例子中找出所有以b開頭的標簽,這表示和標簽都應該被找到:

import re
for tag in soup.find_all(re.compile("^b")):
  print(tag.name)
# body
# b

下面代碼找出所有名字中包含”t”的標簽:

for tag in soup.find_all(re.compile("t")):
  print(tag.name)
# html
# title

3.列表

如果傳入列表參數(shù),Beautiful Soup會將與列表中任一元素匹配的內(nèi)容返回.下面代碼找到文檔中所有標簽和標簽:

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

4.方法（自定義函數(shù)，傳入find_all）

如果沒有合適過濾器,那么還可以定義一個方法,方法只接受一個元素參數(shù) [4] ,如果這個方法返回 True 表示當前元素匹配并且被找到,如果不是則反回 False
下面方法校驗了當前元素,如果包含 class 屬性卻不包含 id 屬性,那么將返回 True:

def has_class_but_no_id(tag):
  return tag.has_attr('class') and not tag.has_attr('id')```

返回結(jié)果中只有

標簽沒有標簽,因為標簽還定義了”id”,沒有返回和,因為和中沒有定義”class”屬性.
下面代碼找到所有被文字包含的節(jié)點內(nèi)容:

from bs4 import NavigableString
def surrounded_by_strings(tag):
  return (isinstance(tag.next_element, NavigableString)
      and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
  print tag.name
# p
# a
# a
# a
# p

5.按照CSS搜索

按照CSS類名搜索tag的功能非常實用,但標識CSS類名的關鍵字 class 在Python中是保留字,使用 class 做參數(shù)會導致語法錯誤.從Beautiful Soup的4.1.1版本開始,可以通過 class_ 參數(shù)搜索有指定CSS類名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

或者：

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link1">Elsie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link2">Lacie</a>,
# <a class="sister"  rel="external nofollow" rel="external nofollow" rel="external nofollow" id="link3">Tillie</a>]

6.按照text參數(shù)查找

通過 text 參數(shù)可以搜搜文檔中的字符串內(nèi)容.與 name 參數(shù)的可選值一樣, text 參數(shù)接受字符串 , 正則表達式 , 列表, True . 看例子:

soup.find_all(text="Elsie")
# [u'Elsie']

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(text=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
  ""Return True if this string is the only child of its parent tag.""
  return (s == s.parent.string)

soup.find_all(text=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

雖然 text 參數(shù)用于搜索字符串,還可以與其它參數(shù)混合使用來過濾tag.Beautiful Soup會找到 .string 方法與 text 參數(shù)值相符的tag.下面代碼用來搜索內(nèi)容里面包含“Elsie”的標簽:

soup.find_all("a", text="Elsie")
# [<a  rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" class="sister" id="link1">Elsie</a>]

7.只查找當前標簽的子節(jié)點

調(diào)用tag的 find_all() 方法時,Beautiful Soup會檢索當前tag的所有子孫節(jié)點,如果只想搜索tag的直接子節(jié)點,可以使用參數(shù) recursive=False .

一段簡單的文檔:

<html>
 <head>
 <title>
  The Dormouse's story
 </title>
 </head>
...

是否使用 recursive 參數(shù)的搜索結(jié)果:

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

到此這篇關于詳解BeautifulSoup獲取特定標簽下內(nèi)容的方法的文章就介紹到這了,更多相關BeautifulSoup獲取特定標簽內(nèi)容內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

詳解BeautifulSoup獲取特定標簽下內(nèi)容的方法

相關文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具