快捷導(dǎo)航

Python xpath表達(dá)式如何實(shí)現(xiàn)數(shù)據(jù)處理

更新時(shí)間：2020年06月13日 15:53:24 作者：_夕顏

這篇文章主要介紹了Python xpath表達(dá)式如何實(shí)現(xiàn)數(shù)據(jù)處理,文中通過(guò)示例代碼介紹的非常詳細(xì)，對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下

xpath表達(dá)式

1. xpath語(yǔ)法

<bookstore>
<book>
 <title lang="eng">Harry Potter</title>
 <price>999</price>
</book>
<book>
 <title lang="eng">Learning XML</title>
 <price>888</price>
</book>
</bookstore>

1.1 選取節(jié)點(diǎn)

XPath 使用路徑表達(dá)式來(lái)選取 XML 文檔中的節(jié)點(diǎn)或者節(jié)點(diǎn)集。這些路徑表達(dá)式和我們?cè)诔Ｒ?guī)的電腦文件系統(tǒng)中看到的表達(dá)式非常相似。

使用chrome插件選擇標(biāo)簽時(shí)候，選中時(shí)，選中的標(biāo)簽會(huì)添加屬性class="xh-highlight"

下面列出了最有用的表達(dá)式：

表達(dá)式	描述
nodename	選中該元素。
/	從根節(jié)點(diǎn)選取、或者是元素和元素間的過(guò)渡。
//	從匹配選擇的當(dāng)前節(jié)點(diǎn)選擇文檔中的節(jié)點(diǎn)，而不考慮它們的位置。
.	選取當(dāng)前節(jié)點(diǎn)。
..	選取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)。
@	選取屬性。
text()	選取文本。

實(shí)例

路徑表達(dá)式	結(jié)果
bookstore	選擇bookstore元素。
/bookstore	選取根元素 bookstore。注釋?zhuān)杭偃缏窂狡鹗加谡备? / )，則此路徑始終代表到某元素的絕對(duì)路徑！
bookstore/book	選取屬于 bookstore 的子元素的所有 book 元素。
//book	選取所有 book 子元素，而不管它們?cè)谖臋n中的位置。
bookstore//book	選擇屬于 bookstore 元素的后代的所有 book 元素，而不管它們位于 bookstore 之下的什么位置。
//book/title/@lang	選擇所有的book下面的title中的lang屬性的值。
//book/title/text()	選擇所有的book下面的title的文本。

選擇所有的h1下的文本
//h1/text()
獲取所有的a標(biāo)簽的href
//a/@href
獲取html下的head下的title的文本
/html/head/title/text()
獲取html下的head下的link標(biāo)簽的href
/html/head/link/@href

1.2 查找特定的節(jié)點(diǎn)

路徑表達(dá)式	結(jié)果
//title[@lang="eng"]	選擇lang屬性值為eng的所有title元素
/bookstore/book[1]	選取屬于 bookstore 子元素的第一個(gè) book 元素。
/bookstore/book[last()]	選取屬于 bookstore 子元素的最后一個(gè) book 元素。
/bookstore/book[last()-1]	選取屬于 bookstore 子元素的倒數(shù)第二個(gè) book 元素。
/bookstore/book[position()>1]	選擇bookstore下面的book元素，從第二個(gè)開(kāi)始選擇
//book/title[text()='Harry Potter']	選擇所有book下的title元素，僅僅選擇文本為Harry Potter的title元素
/bookstore/book[price>35.00]/title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00。

注意點(diǎn): 在xpath中，第一個(gè)元素的位置是1，最后一個(gè)元素的位置是last(),倒數(shù)第二個(gè)是last()-1

1.3 選取未知節(jié)點(diǎn)

XPath 通配符可用來(lái)選取未知的 XML 元素。

通配符	描述
*	匹配任何元素節(jié)點(diǎn)。
@*	匹配任何屬性節(jié)點(diǎn)。
node()	匹配任何類(lèi)型的節(jié)點(diǎn)。

實(shí)例

在下面的表格中，我們列出了一些路徑表達(dá)式，以及這些表達(dá)式的結(jié)果：

路徑表達(dá)式	結(jié)果
/bookstore/*	選取 bookstore 元素的所有子元素。
//*	選取文檔中的所有元素。
//title[@*]	選取所有帶有屬性的 title 元素。

1.4 選取若干路徑

通過(guò)在路徑表達(dá)式中使用“|”運(yùn)算符，您可以選取若干個(gè)路徑。

實(shí)例

在下面的表格中，我們列出了一些路徑表達(dá)式，以及這些表達(dá)式的結(jié)果：

路徑表達(dá)式	結(jié)果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素。
//title \| //price	選取文檔中的所有 title 和 price 元素。
/bookstore/book/title \| //price	選取屬于 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素。

實(shí)例：

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

html = etree.HTML(text)

#獲取href的列表和title的列表
href_list = html.xpath("http://li[@class='item-1']/a/@href")
title_list = html.xpath("http://li[@class='item-1']/a/text()")


#組裝成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

# 如果取到的是一個(gè)節(jié)點(diǎn)，返回的是element對(duì)象，可以繼續(xù)使用xpath方法，對(duì)此我們可以在后面的數(shù)據(jù)提取過(guò)程中：先根據(jù)某個(gè)標(biāo)簽進(jìn)行分組，分組之后再進(jìn)行數(shù)據(jù)的提取
li_list = html.xpath("http://li[@class='item-1']")

#在每一組中繼續(xù)進(jìn)行數(shù)據(jù)的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

以上就是本文的全部?jī)?nèi)容，希望對(duì)大家的學(xué)習(xí)有所幫助，也希望大家多多支持腳本之家。

您可能感興趣的文章: