快捷導(dǎo)航

python中l(wèi)xml庫之etree使用步驟詳解

更新時(shí)間：2025年03月13日 11:02:57 作者：閑人陳二狗

這篇文章主要介紹了python中l(wèi)xml庫之etree使用的相關(guān)資料,lxml庫中的etree模塊提供了一個(gè)簡單而靈活的API來解析和操作XML/HTML文檔,文中通過代碼介紹的非常詳細(xì),需要的朋友可以參考下

一、 etree 介紹

lxml 庫是 Python 中一個(gè)強(qiáng)大的 XML 處理庫，簡單來說，etree 模塊提供了一個(gè)簡單而靈活的API來解析和操作 XML/HTML 文檔。

官方網(wǎng)址：The lxml.etree Tutorial
安裝：pip install lxml

二、xpath 解析 html/xml

1、第一步就是使用 etree 連接 html/xml 代碼/文件。

語法：

root = etree.XML(xml代碼) #xml 接入
root = etree.HTML(html代碼) #html 接入
引入 from lxml import etree

from lxml import etree

root = etree.XML("<root>data</root>")
print(root.tag)
#root
print(etree.tostring(root))
#b'<root>data</root>'
 
root = etree.HTML("<p>data</p>")
print(root.tag)
#html
print(etree.tostring(root))
#b'<html><body><p>data</p></body></html>'

2、 xpath 表達(dá)式定位

xpath 使用路徑表達(dá)式在 HTML/XML 文檔中選取節(jié)點(diǎn)。節(jié)點(diǎn)是通過沿著路徑或者 step 來選取的。下面列出了最有用的路徑表達(dá)式：

表達(dá)式	描述
/	從根節(jié)點(diǎn)選?。ㄈ∽庸?jié)點(diǎn)）
//	任意節(jié)點(diǎn)，不考慮位置（取子孫節(jié)點(diǎn)）
.	選取當(dāng)前節(jié)點(diǎn)
…	選取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
@	選取屬性
contain(@屬性，“包含的內(nèi)容”)	模糊查詢
text()	文本內(nèi)容

① xpath結(jié)合屬性定位

html.xpath(“.//標(biāo)簽名[@屬性=‘屬性值’]”) #注意，這返回的是列表?。?/li>
[] ：表示要根據(jù)屬性找元素
@ ：后邊跟屬性的key,表示要通過哪個(gè)屬性定位

from lxml import etree
 
ht = """<html>
  <head>
    <title>This is a sample document</title>
  </head>
  <body>
    <h1 class="title">Hello!</h1>
    <p>This is a paragraph with <b>bold</b> text in it!</p>
    <p>This is another paragraph, with a
      <a  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >link</a>.</p>
    <p>Here are some reserved characters: &lt;spam&amp;egg&gt;.</p>
    <p>And finally an embedded XHTML fragment.</p>
  </body>
</html>"""
 
html = etree.HTML(ht)
 
title = html.xpath(".//h1[@class='title']")[0] #取列表中的第一個(gè)元素
print(etree.tostring(title))
#b'<h1 class="title">Hello!</h1>\n    '
print(title.get('class'))
# title

② xpath文本定位及獲取

ele = html.xpath(“.//標(biāo)簽名[text()=‘文本值’]”)[0]
text1 = ele.text #獲取元素文本1，ele為定位后的元素
text2 = html.xpath(“string(.//標(biāo)簽名[@屬性=‘屬性值’])”) #獲取元素文本2，返回文本
text3 = html.xpath(“.//標(biāo)簽名[@屬性=‘屬性值’]/text()”) #獲取元素文本3，返回文本列表

title1 = html.xpath(".//h1[text()='Hello!']")[0] #取列表中的第一個(gè)元素
text1 = title1.text
print(text1)
#Hello!
text2 = html.xpath("string(.//h1[@class='title'])")
print(text2)
#Hello!
text3 = html.xpath(".//h1[@class='title']/text()") #返回列表
print(text3)
#['Hello!']

③ xpath層級定位

實(shí)際開發(fā)時(shí)，若需求元素沒有像 id、name、class 等基本屬性，那么我們就需要借助相鄰的元素定位，首先我們可以定位到相鄰元素，然后通過層級關(guān)系來定位最終元素。

html.xpath(“.//父元素標(biāo)簽名[@父元素屬性=‘父元素屬性值’]/子元素標(biāo)簽名”) #由上到下的層級關(guān)系，目標(biāo)是子元素
html.xpath(“.//子元素標(biāo)簽名[@子元素屬性=‘子元素屬性值’]/parent::父元素標(biāo)簽名”) #父子元素定位，目標(biāo)是父元素在這里插入代碼片
html.xpath(“.//元素標(biāo)簽名[@元素屬性=‘元素屬性值’]//preceding-sibling::哥哥元素標(biāo)簽名”) #哥哥元素定位，目標(biāo)是哥哥元素
html.xpath(“.//元素標(biāo)簽名[@元素屬性=‘元素屬性值’]//following-sibling::弟弟元素標(biāo)簽名”) #弟弟元素定位，目標(biāo)是弟弟元素

from lxml import etree
 
ht = """<html>
  <head>
    <title>This is a sample document</title>
  </head>
  <body>
    <h1 class="title">Hello!</h1>
    <p>This is a paragraph with <b>bold</b> text in it!</p>
    <p class="para">This is another paragraph, with a
      <a  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >link</a>.</p>
    <p>Here are some reserved characters: <spam&egg>.</p>
    <p>And finally an embedded XHTML fragment.</p>
  </body>
</html>"""
 
html = etree.HTML(ht)
 
 
ele1 = html.xpath(".//p[@class='para']/a")[0] #由上到下的層級關(guān)系
print(etree.tostring(ele1))
#b'<a  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >link</a>.'
 
ele2 = html.xpath(".//a[@)[0]#父子元素定位
print(etree.tostring(ele2))
#b'<p class="para">This is another paragraph, with a\n      <a  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >link</a>.</p>\n    '
 
ele3 = html.xpath(".//p[@class='para']//preceding-sibling::p")[0] #哥哥元素定位
print(etree.tostring(ele3))
#b'<p>This is a paragraph with <b>bold</b> text in it!</p>\n    '
 
ele4 = html.xpath(".//p[@class='para']//following-sibling::p") #弟弟元素定位
for ele in ele4:
    print(etree.tostring(ele))
    #b'<p>Here are some reserved characters: <spam&egg>.</p>\n    '
    #b'<p>And finally an embedded XHTML fragment.</p>\n  '

④ xpath索引定位

etree 結(jié)合 xpath 進(jìn)行索引定位主要有兩種方式，主要是因?yàn)?html.xpath() 返回的是一個(gè)列表。

html.xpath(“xpath表達(dá)式”)[0] #獲取列表中第一個(gè)元素
html.xpath(“xpath表達(dá)式”)[-1] #獲取列表中最后一個(gè)元素
html.xpath(“xpath表達(dá)式”)[-2] #獲取列表中倒數(shù)第二個(gè)元素

ele1 = html.xpath(".//body/p")[0]
print(etree.tostring(ele1))
#b'<p>This is a paragraph with <b>bold</b> text in it!</p>\n    '
 
ele1 = html.xpath(".//body/p")[-1]
print(etree.tostring(ele1))
#b'<p>And finally an embedded XHTML fragment.</p>\n  '

語法2：

html.xpath(“xpath表達(dá)式[1]”)[0] #獲取第一個(gè)元素
html.xpath(“xpath表達(dá)式[last()]”)[0] #獲取最后一個(gè)元素

html.xpath(“xpath表達(dá)式[last()-1]”)[0] #獲取倒數(shù)第二個(gè)元素

 注：與python列表索引的概念不同，xpath 的標(biāo)簽索引是從1開始；python列表的索引是從0開始。

⑤ xpath模糊匹配

有時(shí)會遇到屬性值過長的情況，此時(shí)我們可以通過模糊匹配來處理，只需要屬性值的部分內(nèi)容即可。

html.xpath(“.//標(biāo)簽名[start-with(@屬性, ‘屬性值開頭’)]”) #匹配開頭
html.xpath(“.//標(biāo)簽名[ends-with(@屬性, ‘屬性值結(jié)尾’)]”) #匹配結(jié)尾

html.xpath(“.//標(biāo)簽名[contains(text(), ‘部分文本’)]”) #包含部分文本

 注：ends-with方法是 xpath 2.0 的語法，而 etree 只支持 xpth 1.0，所以可能不會成功。

ele1 = html.xpath(".//p[starts-with(@class,'par')]")[0] #匹配開頭
print(etree.tostring(ele1))
#b'<p class="para">This is another paragraph, with a\n      <a  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow" >link</a>.</p>\n    '
 
ele2 = html.xpath(".//p[ends-with(@class, 'ara')]")[0] #匹配結(jié)尾
print(etree.tostring(ele2))
 
ele3 = html.xpath(".//p[contains(text(),'is a paragraph with')]")[0] #包含“is a paragraph with”
print(etree.tostring(ele3))
#b'<p>This is a paragraph with <b>bold</b> text in it!</p>\n    '