欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

python網(wǎng)絡(luò)爬蟲精解之Beautiful Soup的使用說明

 更新時間:2021年09月28日 08:33:39   作者:小狐貍夢想去童話鎮(zhèn)  
簡單來說,Beautiful Soup 是 python 的一個庫,最主要的功能是從網(wǎng)頁抓取數(shù)據(jù),Beautiful Soup 提供一些簡單的、python 式的函數(shù)用來處理導航、搜索、修改分析樹等功能,需要的朋友可以參考下

一、Beautiful Soup的介紹

Beautiful Soup是一個強大的解析工具,它借助網(wǎng)頁結(jié)構(gòu)和屬性等特性來解析網(wǎng)頁。

它提供一些函數(shù)來處理導航、搜索、修改分析樹等功能,Beautiful Soup不需要考慮文檔的編碼格式。Beautiful Soup在解析時實際上需要依賴解析器,常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html測試實例:

<!DOCTYPE html>
<html>
<head>
    <meta content="text/html;charset=utf-8" http-equiv="content-type" />
    <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
    <meta content="always" name="referrer" />
    <link  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />
    <title>百度一下,你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
    <div id="head">
        <div class="head_wrapper">
          <div id="u1">
            <a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">新聞 </a>
            <a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>
            <a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">地圖 </a>
            <a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">視頻 </a>
            <a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">貼吧 </a>
            <a class="bri"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
          </div>
        </div>
    </div>
  </div>
</body>
</html>

1、節(jié)點選擇器

我們之前了解到,一個網(wǎng)頁是由若干個元素節(jié)點組成的,通過提取某個節(jié)點的具體內(nèi)容,就可以獲取到界面呈現(xiàn)的一些數(shù)據(jù)。使用節(jié)點選擇器能夠簡化我們獲取數(shù)據(jù)的過程,在不使用正則表達式的前提下,精準的獲取數(shù)據(jù)。

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.head.title)
print(soup.a)

【運行結(jié)果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>
<title>百度一下,你就知道 </title>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>

分析:

第一條打印數(shù)據(jù)為獲取網(wǎng)頁的head節(jié)點;

第二條打印內(nèi)容是獲取head節(jié)點中title節(jié)點,獲取該節(jié)點使用了一個嵌套選擇,因為title節(jié)點是嵌套在head節(jié)點里面的;

第三條打印內(nèi)容是獲取a節(jié)點,在源碼中我們看到有許多條a節(jié)點,而只匹配到第一個a節(jié)點就結(jié)束了。當有多個節(jié)點時,這種選擇方式指只會選擇第一個匹配的節(jié)點,其他后面節(jié)點會忽略。

2、提取信息

一般我們需要的數(shù)據(jù)位于節(jié)點名、屬性值、文本值中,以下代碼展示了如何獲取這三個地方的數(shù)據(jù):

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.body.name)
print(soup.body.a.attrs['class'])
print(soup.body.a.attrs['href'])
print(soup.body.a.string)

【運行結(jié)果】

body
['mnav']
http://news.baidu.com
新聞

分析:

第一條獲取body節(jié)點名;

第二條獲取a節(jié)點class屬性值;

第三條獲取a節(jié)點href屬性值;

第四條獲取a節(jié)點的文本值;

3、關(guān)聯(lián)選擇

(1)子節(jié)點和子孫節(jié)點

子節(jié)點可以調(diào)用contents屬性和children屬性,子孫節(jié)點可以調(diào)用descendants屬性,他們返回結(jié)果都是生成器類型,通過for循環(huán)輸出匹配到的信息。

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
# print(soup.body.contents)
for i,content in enumerate(soup.body.contents):
    print(i,content)

【運行結(jié)果】

0

1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
</div>
</div>
</div>
</div>
2

(2)父節(jié)點和祖先節(jié)點

獲取某個節(jié)點的父節(jié)點可以調(diào)用parent屬性,例如獲取實例中title節(jié)點的父節(jié)點:

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.title.parent)

【運行結(jié)果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下,你就知道 </title>
</head>

同理,如果是想要獲取節(jié)點的祖先節(jié)點,則可調(diào)用parents屬性。

(3)兄弟節(jié)點

調(diào)用next_sibling獲取節(jié)點的下一個兄弟元素;

調(diào)用previous_sibling獲取節(jié)點的上一個兄弟元素;

調(diào)用next_siblings取節(jié)點的下一個兄弟節(jié)點;

調(diào)用previous_siblings獲取節(jié)點的上一個兄弟節(jié)點;

4、方法選擇器

find_all()

查找所有符合條件的元素,其使用方法如下:

find_all(name,attrs,recursive,text,**kwargs)

(1)name

根據(jù)節(jié)點名來查詢元素,例如查詢實例中a標簽元素:

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
    print(a)

【運行結(jié)果】

[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>

(2)attrs

在查詢時我們還可以傳入標簽的屬性,attrs參數(shù)的數(shù)據(jù)類型是字典。

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【運行結(jié)果】

[<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]

可以看到,在加上class=“bri”屬性時,查詢結(jié)果就只剩一條a標簽元素。

(3)text

text參數(shù)可以用來匹配節(jié)點的文本,傳入的可以是字符串,也可以是正則表達式對象。

import re
from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = re.compile('新聞')))

【運行結(jié)果】

[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>]

只包含文本內(nèi)容為“新聞”的a標簽。

find()

find()的使用與前者相似,唯一不同的是,find進匹配搜索到的第一個元素,然后返回單個元素,find_all()則是匹配所有符合條件的元素,返回一個列表。

5、CSS選擇器

使用CSS選擇器時,調(diào)用select()方法,傳入相應(yīng)的CSS選擇器;

例如使用CSS選擇器獲取實例中的a標簽

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.select('a'))
for a in soup.select('a'):
    print(a)

【運行結(jié)果】

[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>

獲取屬性

獲取上述a標簽中的href屬性

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
    print(a['href'])

【運行結(jié)果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

獲取文本

獲取上述a標簽的文本內(nèi)容,使用get_text()方法,或者是string獲取文本內(nèi)容

from bs4 import BeautifulSoup

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
    print(a.get_text())
    print(a.string)

【運行結(jié)果】

新聞
新聞
hao123
hao123
地圖
地圖
視頻
視頻
貼吧
貼吧
更多產(chǎn)品
更多產(chǎn)品

到此這篇關(guān)于python網(wǎng)絡(luò)爬蟲精解之Beautiful Soup的使用說明的文章就介紹到這了,更多相關(guān)python Beautiful Soup 內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!

相關(guān)文章

  • Python的pygame安裝教程詳解

    Python的pygame安裝教程詳解

    Pygame是跨平臺Pyth,Pygame 作者是 Pete Shinners, 協(xié)議為 GNU Lesser General Public License。這篇文章主要介紹了Python的pygame安裝教程,需要的朋友可以參考下
    2020-02-02
  • LangChain簡化ChatGPT工程復雜度使用詳解

    LangChain簡化ChatGPT工程復雜度使用詳解

    這篇文章主要為大家介紹了LangChain簡化ChatGPT工程復雜度使用詳解,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進步,早日升職加薪
    2023-03-03
  • python發(fā)送json參數(shù)的實例代碼

    python發(fā)送json參數(shù)的實例代碼

    在寫腳本的過程中,除了發(fā)送form表單參數(shù)之外,我們還會發(fā)送json格式的參數(shù)。那么碰見json格式要怎么發(fā)送呢,這篇我們來解決這個問題,需要的朋友可以參考下
    2019-10-10
  • Python實現(xiàn)的網(wǎng)頁截圖功能【PyQt4與selenium組件】

    Python實現(xiàn)的網(wǎng)頁截圖功能【PyQt4與selenium組件】

    這篇文章主要介紹了Python實現(xiàn)的網(wǎng)頁截圖功能,結(jié)合實例形式分別描述了使用PyQt4組件與selenium組件進行網(wǎng)頁截圖操作的相關(guān)實現(xiàn)技巧與注意事項,需要的朋友可以參考下
    2018-07-07
  • 我的快遞一個月沒動靜于是趕緊上線python快遞查詢系統(tǒng)

    我的快遞一個月沒動靜于是趕緊上線python快遞查詢系統(tǒng)

    我的快遞在路上走了一個月還沒到,于是自己編寫快遞查詢,文中通過實例代碼截圖的形式給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友參考下吧
    2021-09-09
  • Python中scatter散點圖及顏色整理大全

    Python中scatter散點圖及顏色整理大全

    python自帶的scatter函數(shù)參數(shù)中顏色和大小可以輸入列表進行控制,即可以讓不同的點有不同的顏色和大小,下面這篇文章主要給大家介紹了關(guān)于Python中scatter散點圖及顏色整理大全的相關(guān)資料,需要的朋友可以參考下
    2023-05-05
  • pandas之數(shù)據(jù)修改與基本運算方式

    pandas之數(shù)據(jù)修改與基本運算方式

    這篇文章主要介紹了pandas之數(shù)據(jù)修改與基本運算方式,具有很好的參考價值,希望對大家有所幫助,如有錯誤或未考慮完全的地方,望不吝賜教
    2024-02-02
  • python保存網(wǎng)頁圖片到本地的方法

    python保存網(wǎng)頁圖片到本地的方法

    這篇文章主要為大家詳細介紹了python保存網(wǎng)頁圖片到本地的方法,具有一定的參考價值,感興趣的小伙伴們可以參考一下
    2018-07-07
  • Python實現(xiàn)的人工神經(jīng)網(wǎng)絡(luò)算法示例【基于反向傳播算法】

    Python實現(xiàn)的人工神經(jīng)網(wǎng)絡(luò)算法示例【基于反向傳播算法】

    這篇文章主要介紹了Python實現(xiàn)的人工神經(jīng)網(wǎng)絡(luò)算法,結(jié)合實例形式分析了Python基于反向傳播算法實現(xiàn)的人工神經(jīng)網(wǎng)絡(luò)相關(guān)操作技巧,需要的朋友可以參考下
    2017-11-11
  • Python爬蟲獲取基金基本信息

    Python爬蟲獲取基金基本信息

    這篇文章主要介紹了Python爬蟲獲取基金基本信息,文章基于上一篇文章內(nèi)容基于python的相關(guān)資料展開主題,需要的小伙伴可以參考一下
    2022-05-05

最新評論