快捷導(dǎo)航

python爬蟲爬取指定內(nèi)容的解決方法

更新時(shí)間：2022年06月14日 09:03:16 作者：皓_月

這篇文章主要介紹了python爬蟲爬取指定內(nèi)容,爬取一些網(wǎng)站下指定的內(nèi)容，一般來說可以用xpath來直接從網(wǎng)頁(yè)上來獲取，但是當(dāng)我們獲取的內(nèi)容不唯一的時(shí)候我們無法選擇，我們所需要的、所指定的內(nèi)容，需要的朋友可以參考下

解決辦法：

可以使用for In 語(yǔ)句來判斷
如果我們所指定的內(nèi)容在這段語(yǔ)句中我們就把這段內(nèi)容爬取下來，反之就丟棄

實(shí)列代碼如下：（以我們學(xué)校為例）

import urllib.request
from lxml import etree
def creat_url(page):
    if(page==1):
        url='https://www.qjnu.edu.cn/channels/9260.html'
    else:
        url='https://www.qjnu.edu.cn/channels/9260_'+str(page)+'.html'
    headers={
        'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.53'
    }
    request = urllib.request.Request(url=url,headers=headers)
    return request
def creat_respons(request):
    respons = urllib.request.urlopen(request)
    content = respons.read().decode('utf-8')
    return content
def down_2(url):
    url = url
    headers = {
        'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36 Edg/100.0.1185.29'
    }
    request = urllib.request.Request(url=url, headers=headers)
    response = urllib.request.urlopen(request)
    content2 = response.read().decode('utf-8')
    tree2 = etree.HTML(content2)
    return tree2
def down_loads(content):
    tree = etree.HTML(content)
    name_list = tree.xpath('//div[@class="media"]/h4/a/text()')
    url_list = tree.xpath('//div[@class="media"]/h4/a/@href')
    for i in range(len(name_list)):
        if key in name_list[i]:
            with open('學(xué)校黨員主題網(wǎng)址.txt', 'a', encoding='UTF-8') as fp:
                fp.write(url_list[i]+'\n')
            url = url_list[i]
            tree = down_2(url)
            tex_list = tree.xpath('//div[@class="field-item even"]//p/span/text()')
            name = name_list[i]
            with open(name + '.txt', 'w', encoding='UTF-8') as fp:
                fp.write(str(tex_list))
if __name__ == '__main__':
    all_page=int(input('請(qǐng)輸入要爬取頁(yè)碼：'))
    key = str(input('請(qǐng)輸入關(guān)鍵詞：'))
    s_page=1
    for page in range(s_page,all_page+1):
        request=creat_url(page)
        content=creat_respons(request)
        down_loads(content)

此段代碼的可執(zhí)行性沒有問題，邏輯上也能夠串通
但是代碼冗余較多，看起來有點(diǎn)復(fù)雜，現(xiàn)在正在研究簡(jiǎn)化版的代碼！

到此這篇關(guān)于python爬蟲爬取指定內(nèi)容的解決方法的文章就介紹到這了,更多相關(guān)python爬取指定內(nèi)容內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: