腳本之家服務(wù)器常用軟件

快捷導(dǎo)航

軟件下載

android MAC 驅(qū)動(dòng)下載字體下載 DLL

源碼下載

PHP ASP.NET ASP JSP

軟件編程

C# JAVA C 語(yǔ)言 Delphi Android

網(wǎng)絡(luò)編程

PHP ASP.NET ASP JavaScript

在線工具

CSS格式化 JS格式化 Html轉(zhuǎn)化為Js

數(shù)據(jù)庫(kù)

MYSQL MSSQL oracle DB2 MARIADB

CMS

PHPCMS DEDECMS 帝國(guó)CMS WordPress

常用工具

PHP開發(fā)工具 python Photoshop 必備軟件

Python實(shí)戰(zhàn)快速上手BeautifulSoup庫(kù)爬取專欄標(biāo)題和地址

更新時(shí)間：2021年10月20日 14:48:50 作者：小旺不正經(jīng)

BeautifulSoup是爬蟲必學(xué)的技能，BeautifulSoup最主要的功能是從網(wǎng)頁(yè)抓取數(shù)據(jù)，Beautiful Soup自動(dòng)將輸入文檔轉(zhuǎn)換為Unicode編碼，輸出文檔轉(zhuǎn)換為utf-8編碼

安裝

pip install beautifulsoup4
# 上面的安裝失敗使用下面的 使用鏡像
pip install beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple

使用PyCharm的命令行

解析標(biāo)簽

from bs4 import BeautifulSoup
import requests
url='https://blog.csdn.net/weixin_42403632/category_11076268.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html=requests.get(url,headers=headers).text
s=BeautifulSoup(html,'html.parser')
title =s.select('h2')
for i in title:
    print(i.text)

第一行代碼：導(dǎo)入BeautifulSoup庫(kù)

第二行代碼：導(dǎo)入requests

第三、四、五行代碼：獲取url的html

第六行代碼：激活BeautifulSoup庫(kù) 'html.parser'設(shè)置解析器為HTML解析器

第七行代碼：選取所有<h2>標(biāo)簽

解析屬性

BeautifulSoup庫(kù) 支持根據(jù)特定屬性解析網(wǎng)頁(yè)元素

根據(jù)class值解析

from bs4 import BeautifulSoup
import requests
url='https://blog.csdn.net/weixin_42403632/category_11076268.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html=requests.get(url,headers=headers).text
s=BeautifulSoup(html,'html.parser')
title =s.select('.column_article_title')
for i in title:
    print(i.text)

根據(jù)ID解析

from bs4 import BeautifulSoup
html='''<div class="crop-img-before">
         <img src="" alt="" id="cropImg">
      </div>
        <div id='title'>
        測(cè)試成功
        </div>
      <div class="crop-zoom">
         <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-reduce">-</a><a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-add">+</a>
      </div>
      <div class="crop-img-after">
         <div  class="final-img"></div>
      </div>'''
s=BeautifulSoup(html,'html.parser')
title =s.select('#title')
for i in title:
    print(i.text)

多層篩選

from bs4 import BeautifulSoup
html='''<div class="crop-img-before">
         <img src="" alt="" id="cropImg">
      </div>
        <div id='title'>
        456456465
        <h1>測(cè)試成功</h1>
        </div>
      <div class="crop-zoom">
         <a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-reduce">-</a><a href="javascript:;" rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  class="bt-add">+</a>
      </div>
      <div class="crop-img-after">
         <div  class="final-img"></div>
      </div>'''
s=BeautifulSoup(html,'html.parser')
title =s.select('#title')
for i in title:
    print(i.text)
title =s.select('#title h1')
for i in title:
    print(i.text)

提取a標(biāo)簽中的網(wǎng)址

title =s.select('a')
for i in title:
    print(i['href'])

實(shí)戰(zhàn)-獲取博客專欄標(biāo)題+網(wǎng)址

from bs4 import BeautifulSoup
import requests
import re
url='https://blog.csdn.net/weixin_42403632/category_11298953.html'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'}
html=requests.get(url,headers=headers).text
s=BeautifulSoup(html,'html.parser')
title =s.select('.column_article_list li a')
for i in title:
    print((re.findall('原創(chuàng).*?\n(.*?)\n',i.text))[0].lstrip())
    print(i['href'])

到此這篇關(guān)于Python實(shí)戰(zhàn)快速上手BeautifulSoup庫(kù)爬取專欄標(biāo)題和地址的文章就介紹到這了,更多相關(guān)Python BeautifulSoup庫(kù)內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: