快捷導(dǎo)航

從零開始學(xué)習(xí)Python與BeautifulSoup網(wǎng)頁數(shù)據(jù)抓取

更新時(shí)間：2024年01月28日 10:33:03 作者：程序員曉曉

想要從零開始學(xué)習(xí)Python和BeautifulSoup網(wǎng)頁數(shù)據(jù)抓?。勘局改蠈槟闾峁┖唵我锥闹笇?dǎo),讓你掌握這兩個(gè)強(qiáng)大的工具,不管你是初學(xué)者還是有經(jīng)驗(yàn)的開發(fā)者,本指南都能幫助你快速入門并提升技能,不要錯(cuò)過這個(gè)機(jī)會(huì),開始你的編程之旅吧！

在網(wǎng)絡(luò)時(shí)代，數(shù)據(jù)是最寶貴的資源之一。而爬蟲技術(shù)就是一種獲取數(shù)據(jù)的重要手段。Python 作為一門高效、易學(xué)、易用的編程語言，自然成為了爬蟲技術(shù)的首選語言之一。而 BeautifulSoup 則是 Python 中最常用的爬蟲庫之一，它能夠幫助我們快速、簡單地解析 HTML 和 XML 文檔，從而提取出我們需要的數(shù)據(jù)。

本文將介紹如何使用 BeautifulSoup 爬取網(wǎng)頁數(shù)據(jù)，并提供詳細(xì)的代碼和注釋，幫助讀者快速上手。

安裝 BeautifulSoup

在開始之前，我們需要先安裝 BeautifulSoup?？梢允褂?pip 命令進(jìn)行安裝：

pip install beautifulsoup4

爬取網(wǎng)頁數(shù)據(jù)

在本文中，我們將以爬取豆瓣電影 Top250 為例，介紹如何使用 BeautifulSoup 爬取網(wǎng)頁數(shù)據(jù)。

首先，我們需要導(dǎo)入必要的庫：

import requests
from bs4 import BeautifulSoup

然后，我們需要獲取網(wǎng)頁的 HTML 代碼?？梢允褂?requests 庫中的 get() 方法來獲取網(wǎng)頁：

url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text

接下來，我們需要使用 BeautifulSoup 解析 HTML 代碼?？梢允褂?BeautifulSoup 的構(gòu)造方法來創(chuàng)建一個(gè) BeautifulSoup 對(duì)象：

soup = BeautifulSoup(html, 'html.parser')

這里我們使用了 ‘html.parser’ 作為解析器，也可以使用其他解析器，如 lxml、html5lib 等。

現(xiàn)在，我們已經(jīng)成功地將網(wǎng)頁的 HTML 代碼解析成了一個(gè) BeautifulSoup 對(duì)象。接下來，我們可以使用 BeautifulSoup 對(duì)象中的方法來提取我們需要的數(shù)據(jù)。

提取數(shù)據(jù)

在豆瓣電影 Top250 頁面中，每個(gè)電影都包含了電影名稱、導(dǎo)演、演員、評(píng)分等信息。我們可以使用 BeautifulSoup 提供的 find()、find_all() 等方法來提取這些信息。

首先，我們需要找到包含電影信息的 HTML 元素?？梢允褂脼g覽器的開發(fā)者工具來查看網(wǎng)頁的 HTML 代碼，找到對(duì)應(yīng)的元素。在豆瓣電影 Top250 頁面中，每個(gè)電影都包含在一個(gè) class 為 ‘item’ 的 div 元素中：

<div class="item">
  <div class="pic">
    <em class="">1</em>
    <a >
      <img src="https://img9.doubanio.com/view/photo/s_ratio_poster/public/p480747492.jpg" class="" />
    </a>
  </div>
  <div class="info">
    <div class="hd">
      <a  class="">
        <span class="title">肖申克的救贖</span>
        <span class="title">&nbsp;/&nbsp;The Shawshank Redemption</span>
        <span class="other">&nbsp;/&nbsp;月黑高飛(港)  /  刺激1995(臺(tái))</span>
      </a>
      <span class="playable">[可播放]</span>
    </div>
    <div class="bd">
      <p class="">
        導(dǎo)演: 弗蘭克·德拉邦特 Frank Darabont&nbsp;&nbsp;&nbsp;主演: 蒂姆·羅賓斯 Tim Robbins /...<br />
        1994&nbsp;/&nbsp;美國&nbsp;/&nbsp;犯罪 劇情
      </p>
      <div class="star">
        <span class="rating5-t"></span>
        <span class="rating_num" property="v:average">9.7</span>
        <span property="v:best" content="10.0"></span>
        <span>1057904人評(píng)價(jià)</span>
      </div>
      <p class="quote">
        <span class="inq">希望讓人自由。</span>
      </p>
    </div>
  </div>
</div>

我們可以使用 find_all() 方法來找到所有 class 為 ‘item’ 的 div 元素：

items = soup.find_all('div', class_='item')

這里我們使用了 class_ 參數(shù)來指定 class 屬性，因?yàn)?class 是 Python 中的關(guān)鍵字。

現(xiàn)在，我們已經(jīng)成功地找到了所有電影的 HTML 元素。接下來，我們可以使用 BeautifulSoup 對(duì)象中的方法來提取電影信息。

例如，我們可以使用 find() 方法來找到電影名稱所在的 HTML 元素：

title = item.find('span', class_='title').text

這里我們使用了 text 屬性來獲取 HTML 元素的文本內(nèi)容。

類似地，我們可以使用其他方法來提取導(dǎo)演、演員、評(píng)分等信息。完整的代碼如下：

import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/top250'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', class_='item')
for item in items:
    title = item.find('span', class_='title').text
    director = item.find('div', class_='bd').p.text.split()[1]
    actors = item.find('div', class_='bd').p.text.split()[2:]
    rating = item.find('span', class_='rating_num').text
    print('電影名稱：', title)
    print('導(dǎo)演：', director)
    print('演員：', ' '.join(actors))
    print('評(píng)分：', rating)
    print('------------------------')

總結(jié)

本文介紹了如何使用 BeautifulSoup 爬取網(wǎng)頁數(shù)據(jù)，并提供了詳細(xì)的代碼和注釋。通過本文的學(xué)習(xí)，讀者可以掌握如何使用 BeautifulSoup 解析 HTML 和 XML 文檔，從而提取出需要的數(shù)據(jù)。同時(shí)，讀者也可以將本文中的代碼應(yīng)用到其他網(wǎng)頁數(shù)據(jù)的爬取中。

到此這篇關(guān)于從零開始學(xué)習(xí)Python與BeautifulSoup網(wǎng)頁數(shù)據(jù)抓取的文章就介紹到這了,更多相關(guān)BeautifulSoup 爬取網(wǎng)頁數(shù)據(jù)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: