腳本之家服務(wù)器常用軟件

快捷導(dǎo)航

軟件下載

android MAC 驅(qū)動下載字體下載 DLL

源碼下載

PHP ASP.NET ASP JSP

軟件編程

C# JAVA C 語言 Delphi Android

網(wǎng)絡(luò)編程

PHP ASP.NET ASP JavaScript

在線工具

CSS格式化 JS格式化 Html轉(zhuǎn)化為Js

數(shù)據(jù)庫

MYSQL MSSQL oracle DB2 MARIADB

CMS

PHPCMS DEDECMS 帝國CMS WordPress

常用工具

PHP開發(fā)工具 python Photoshop 必備軟件

利用python實現(xiàn)查看溧陽的攝影圈

更新時間：2022年05月17日 11:56:05 作者：??夢想橡皮擦????

這篇文章主要介紹了利用python實現(xiàn)查看溧陽的攝影圈，文章基于BeautifulSoup的相關(guān)資料展開詳細(xì)的內(nèi)容介紹，具有一定的參考價值，需要的小伙伴可以參考一下

目標(biāo)站點分析

本次要采集的目標(biāo)站點分頁規(guī)則如下：

http://www.jsly001.com/thread-htm-fid-45-page-{頁碼}.html

代碼采用多線程 threading 模塊+requests 模塊+BeautifulSoup 模塊編寫。

采取規(guī)則依據(jù)列表頁 → 詳情頁：

溧陽攝影圈圖片采集代碼

本案例屬于實操案例，先展示完整代碼，然后基于注釋與重點函數(shù)進(jìn)行說明。

主要實現(xiàn)步驟如下所示：

設(shè)置日志輸出級別
聲明一個 LiYang 類，其繼承自 threading.Thread
實例化多線程對象
每個線程都去獲取全局資源
調(diào)用html解析函數(shù)
獲取板塊主題分割區(qū)域，主要為防止獲取置頂?shù)闹黝}
使用 lxml 進(jìn)行解析
解析出標(biāo)題與數(shù)據(jù)
解析圖片地址
保存圖片

import random
import threading
import logging
from bs4 import BeautifulSoup
import requests
import lxml
logging.basicConfig(level=logging.NOTSET) # 設(shè)置日志輸出級別
# 聲明一個 LiYang 類，其繼承自 threading.Thread
class LiYangThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self) # 實例化多線程對象
        self._headers = self._get_headers() # 隨機(jī)獲取 ua
        self._timeout = 5 # 設(shè)置超時時間

    # 每個線程都去獲取全局資源
    def run(self):
        # while True: # 此處為多線程開啟位置
        try:
            res = requests.get(url="http://www.jsly001.com/thread-htm-fid-45-page-1.html", headers=self._headers,
                               timeout=self._timeout) # 測試獲取第一頁數(shù)據(jù)
        except Exception as e:
            logging.error(e)
        if res is not None:
            html_text = res.text
            self._format_html(html_text) # 調(diào)用html解析函數(shù)

    def _format_html(self, html):
        # 使用 lxml 進(jìn)行解析
        soup = BeautifulSoup(html, 'lxml')

        # 獲取板塊主題分割區(qū)域，主要為防止獲取置頂?shù)闹黝}
        part_tr = soup.find(attrs={'class': 'bbs_tr4'})

        if part_tr is not None:
            items = part_tr.find_all_next(attrs={"name": "readlink"}) # 獲取詳情頁地址
        else:
            items = soup.find_all(attrs={"name": "readlink"})
        # 解析出標(biāo)題與數(shù)據(jù)
        data = [(item.text, f'http://www.jsly001.com/{item["href"]}') for item in items]
        # 進(jìn)入標(biāo)題內(nèi)頁
        for name, url in data:
            self._get_imgs(name, url)

    def _get_imgs(self, name, url):
        """解析圖片地址"""
        try:
            res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
        except Exception as e:
            logging.error(e)
		# 圖片提取邏輯
        if res is not None:
            soup = BeautifulSoup(res.text, 'lxml')
            origin_div1 = soup.find(attrs={'class': 'tpc_content'})
            origin_div2 = soup.find(attrs={'class': 'imgList'})
            content = origin_div2 if origin_div2 else origin_div1

            if content is not None:
                imgs = content.find_all('img')

                # print([img.get("src") for img in imgs])
                self._save_img(name, imgs) # 保存圖片
    def _save_img(self, name, imgs):
        """保存圖片"""
        for img in imgs:
            url = img.get("src")
            if url.find('http') < 0:
                continue
            # 尋找父標(biāo)簽中的 id 屬性
            id_ = img.find_parent('span').get("id")

            try:
                res = requests.get(url=url, headers=self._headers, timeout=self._timeout)
            except Exception as e:
                logging.error(e)

            if res is not None:
                name = name.replace("/", "_")
                with open(f'./imgs/{name}_{id_}.jpg', "wb+") as f: # 注意在 python 運(yùn)行時目錄提前創(chuàng)建 imgs 文件夾
                    f.write(res.content)
    def _get_headers(self):
        uas = [
            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",
        ]
        ua = random.choice(uas)
        headers = {
            "user-agent": ua
        }
        return headers
if __name__ == '__main__':
    my_thread = LiYangThread()
    my_thread.run()

本次案例采用中，BeautifulSoup 模塊采用 lxml 解析器 對 HTML 數(shù)據(jù)進(jìn)行解析，后續(xù)多采用此解析器，在使用前注意先導(dǎo)入 lxml 模塊。

數(shù)據(jù)提取部分采用 soup.find() 與 soup.find_all() 兩個函數(shù)進(jìn)行，代碼中還使用了 find_parent() 函數(shù)，用于采集父級標(biāo)簽中的 id 屬性。

# 尋找父標(biāo)簽中的 id 屬性
id_ = img.find_parent('span').get("id")

代碼運(yùn)行過程出現(xiàn) DEBUG 信息，控制 logging 日志輸出級別即可。![用python看溧陽攝影圈，里面照片非常真

到此這篇關(guān)于利用python實現(xiàn)查看溧陽的攝影圈的文章就介紹到這了,更多相關(guān)python查看攝影圈內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

軟件下載

源碼下載

軟件編程

網(wǎng)絡(luò)編程

在線工具

數(shù)據(jù)庫

CMS

常用工具

利用python實現(xiàn)查看溧陽的攝影圈

目錄

目標(biāo)站點分析

溧陽攝影圈圖片采集代碼

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具