快捷導(dǎo)航

Python爬蟲(chóng)XPath解析出亂碼的問(wèn)題及解決

更新時(shí)間：2024年05月24日 15:32:27 作者：平人的進(jìn)步日常

這篇文章主要介紹了Python爬蟲(chóng)XPath解析出亂碼的問(wèn)題及解決,具有很好的參考價(jià)值,希望對(duì)大家有所幫助,如有錯(cuò)誤或未考慮完全的地方,望不吝賜教

Python爬蟲(chóng)XPath解析出亂碼

請(qǐng)求后加上編碼

resp = requests.get(url, headers=headers)
resp.encoding = 'GBK'

Python XPath解析html出現(xiàn)â??解決方法 html出現(xiàn)&#123；

爬網(wǎng)頁(yè)又遇到一個(gè)坑，老是出現(xiàn)a亂碼，查看html出現(xiàn)的是&#數(shù)字;這樣的。

網(wǎng)上相關(guān)的“Python字符中出現(xiàn)&#的解決辦法”又沒(méi)有很好的解決，自己繼續(xù)沖浪，費(fèi)了一番功夫解決了。

這算是又加深了一下我對(duì)這些iso、Unicode編碼的理解。故分享。

問(wèn)題

用Python的lxml解析html時(shí)，調(diào)用text()輸出出來(lái)的結(jié)果帶有a這樣的亂碼：

網(wǎng)頁(yè)原頁(yè)面展示：

爬取代碼：

url = "xxx"
 
response = requests.request("GET", url)
 
html = etree.HTML(response.text)
 
# 直接調(diào)用text函數(shù)
description = html.xpath('//div[@class="xxx"]/div/div//text()')
# 直接打印
for desc in description:
    print(desc)

原因

不用說(shuō)自然是編碼的問(wèn)題。下面教大家排查和解決。

排查與解決

首先查看返回的響應(yīng)是如何編碼的：

response = requests.request("GET", url, proxies=proxy)
# 得到響應(yīng)之后，先檢查一下它的編碼方式
print(response.encoding)

結(jié)果如下：

然后根據(jù)這個(gè)編碼的方式再來(lái)解碼：

html = etree.HTML(response.text)
 
description = html.xpath('//div[@class="xxx"]/div/div//text()')
 
for desc in description:
    # print(desc)
    # 根據(jù)上面的結(jié)果，用iso88591來(lái)編碼，再解碼為utf-8
    print(desc.encode("ISO-8859-1").decode("utf-8"))

結(jié)果如下：

完整代碼：

url = "xxx"
 
response = requests.request("GET", url)
print(response.encoding)
 
html = etree.HTML(response.text)
 
description = html.xpath('//div[@class="xxx"]/div/div//text()')
 
for desc in description:
    print(desc.encode("ISO-8859-1").decode("utf-8"))
    # print(desc)

總結(jié)

網(wǎng)上有用python2流傳下來(lái)的HTMLParser的，還有用python3的html包的，效果都不好。

不過(guò)也有改response的編碼方式的，就是這樣：

url = "xxx"
 
response = requests.request("GET", url)
 
# html = etree.HTML(response.text)
html = etree.HTML(response.content)  # 改用二進(jìn)制編碼
 
# 直接調(diào)用text函數(shù)
description = html.xpath('//div[@class="xxx"]/div/div//text()')
# 直接打印
for desc in description:
    print(desc)

也能成功解析。

以上為個(gè)人經(jīng)驗(yàn)，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。

您可能感興趣的文章: