xpath無(wú)法定位tbody標(biāo)簽解決方法示例

更新時(shí)間：2023年09月13日 09:28:48 作者：ponponon

這篇文章主要介紹了xpath無(wú)法定位tbody標(biāo)簽解決方法示例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪

引言

你用 selenium 抓取，必定有 body你用 requests 抓取，不一定有 body

瀏覽器會(huì)對(duì)不存在 body 的情況自動(dòng)加上 body

所以，你用 requests 抓取就去分析 html tree用 selenium 就去分析 render tree

html tree 就是 networks 標(biāo)簽中的 html 內(nèi)容；render tree 就是 Elements 標(biāo)簽頁(yè)中的內(nèi)容

以前的講法有點(diǎn)問(wèn)題，所以再次更新一下，也算是填坑

定位不到tbody是因?yàn)闃?biāo)準(zhǔn)差異，tbody不是必須存在的

chrome的Elements標(biāo)簽頁(yè)的tbody是肯定存在的

但是程序員寫的網(wǎng)頁(yè)不一定會(huì)有tbody

但是在chrome的Elements標(biāo)簽頁(yè)不管返回的html有沒(méi)有tbody，chrome都會(huì)有（有就不加，沒(méi)有就自動(dòng)加上）

所以用selenium請(qǐng)求網(wǎng)頁(yè)數(shù)據(jù)，就加上tbody標(biāo)簽，因?yàn)閟elenium返回的必定是包含tbody的（因?yàn)榉祷氐氖莄hrome的Elements標(biāo)簽頁(yè)的內(nèi)容）

用requests請(qǐng)求的時(shí)候，就自己看看源html內(nèi)是否真的包含tbody標(biāo)簽（可以在chrome的network標(biāo)簽頁(yè)下查看）

總結(jié)：服務(wù)器返回的html不一定有tbody標(biāo)簽（具體看網(wǎng)站前端程序員有沒(méi)有加tbody標(biāo)簽），但是經(jīng)過(guò)chrome渲染的render html必定包含tbody標(biāo)簽（服務(wù)器返回沒(méi)有的話，瀏覽器就給你自動(dòng)加上）

以下是原文：
寫于2019.10.29日

測(cè)試庫(kù)：lxml庫(kù)；鏈接鏈接：http://www.sxchxx.com/index-13-1075-1.html

問(wèn)題發(fā)現(xiàn)

個(gè)人比較喜歡用xpath解析網(wǎng)頁(yè)，但時(shí)常得到的結(jié)果卻是一個(gè)空列表。

1.1 etree.HTML

from lxml import etree
import requests
url = 'http://www.sxchxx.com/index-13-1075-1.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
}
resposne = requests.get(url, headers=headers)
parser = etree.HTMLParser(encoding="utf-8")
html = etree.HTML(resposne.text, parser=parser)
resu=html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()')
print(resu)

當(dāng)用如上代碼解析如下網(wǎng)頁(yè)時(shí)，可以獲取正文

但發(fā)現(xiàn)我們并沒(méi)有在rule里面加入tbody標(biāo)簽。相反，加入tbody標(biāo)簽會(huì)使的解析結(jié)果變成一個(gè)空列表

html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') # 這樣會(huì)得到空列表

1.2 etree.parse

使用etree.parse和etree.HTML恰好相反

from lxml import etree
import requests

parser = etree.HTMLParser(encoding="utf-8")
html = etree.parse('test.html', parser=parser)


content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()')

print(content)

將網(wǎng)頁(yè)保存成test.html，再用etree.parse加載，發(fā)現(xiàn)rule中加入tbody標(biāo)簽才能獲得預(yù)期的結(jié)果；不加tbody標(biāo)簽會(huì)獲得一個(gè)空列表

1.3 代碼對(duì)比

from lxml import etree
import requests
parser = etree.HTMLParser(encoding="utf-8")
html = etree.parse('test.html', parser=parser)
content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()')
print(content)
print('----------------分割線-------------------')
url = 'http://www.sxchxx.com/index-13-1075-1.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36',
}
resposne = requests.get(url, headers=headers)
parser = etree.HTMLParser(encoding="utf-8")
html = etree.HTML(resposne.text, parser=parser)
content = html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()')
print(content)