Python lxml模塊的基本使用方法分析

更新時(shí)間：2019年12月21日 11:21:51 作者：Dylan HU

這篇文章主要介紹了Python lxml模塊的基本使用方法,結(jié)合實(shí)例形式分析了Python安裝與使用lxml模塊常見操作技巧與相關(guān)注意事項(xiàng),需要的朋友可以參考下

本文實(shí)例講述了Python lxml模塊的基本使用方法。分享給大家供大家參考，具體如下：

1 lxml的安裝

安裝方式：pip install lxml

2 lxml的使用

2.1 lxml模塊的入門使用

導(dǎo)入lxml 的 etree 庫 (導(dǎo)入沒有提示不代表不能用)

from lxml import etree

利用etree.HTML，將字符串轉(zhuǎn)化為Element對(duì)象,Element對(duì)象具有xpath的方法,返回結(jié)果的列表，能夠接受bytes類型的數(shù)據(jù)和str類型的數(shù)據(jù)

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把轉(zhuǎn)化后的element對(duì)象轉(zhuǎn)化為字符串，返回bytes類型結(jié)果 etree.tostring(element)

假設(shè)我們現(xiàn)有如下的html字符換，嘗試對(duì)他進(jìn)行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意，此處缺少一個(gè) </li> 閉合標(biāo)簽 
</ul> </div>

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

輸出為

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以發(fā)現(xiàn)，lxml確實(shí)能夠把確實(shí)的標(biāo)簽補(bǔ)充完成，但是請(qǐng)注意lxml是人寫的，很多時(shí)候由于網(wǎng)頁不夠規(guī)范，或者是lxml的bug，即使參考url地址對(duì)應(yīng)的響應(yīng)去提取數(shù)據(jù)，任然獲取不到，這個(gè)時(shí)候我們需要使用etree.tostring的方法，觀察etree到底把html轉(zhuǎn)化成了什么樣子，即根據(jù)轉(zhuǎn)化后的html字符串去進(jìn)行數(shù)據(jù)的提取。

2.2 lxml的深入練習(xí)

接下來我們繼續(xù)操作，假設(shè)每個(gè)class為item-1的li標(biāo)簽是1條新聞數(shù)據(jù)，如何把這條新聞數(shù)據(jù)組成一個(gè)字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#獲取href的列表和title的列表
href_list = html.xpath("http://li[@class='item-1']/a/@href")
title_list = html.xpath("http://li[@class='item-1']/a/text()")
#組裝成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

輸出為

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假設(shè)在某種情況下，某個(gè)新聞的href沒有，那么會(huì)怎樣呢？

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

結(jié)果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

數(shù)據(jù)的對(duì)應(yīng)全部錯(cuò)了，這不是我們想要的，接下來通過2.3小節(jié)的學(xué)習(xí)來解決這個(gè)問題

2.3 lxml模塊的進(jìn)階使用

前面我們?nèi)〉綄傩?，或者是文本的時(shí)候，返回字符串但是如果我們?nèi)〉降氖且粋€(gè)節(jié)點(diǎn)，返回什么呢?

返回的是element對(duì)象，可以繼續(xù)使用xpath方法，對(duì)此我們可以在后面的數(shù)據(jù)提取過程中：先根據(jù)某個(gè)標(biāo)簽進(jìn)行分組，分組之后再進(jìn)行數(shù)據(jù)的提取

示例如下：

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
print(li_list)

結(jié)果為：

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以發(fā)現(xiàn)結(jié)果是一個(gè)element對(duì)象，這個(gè)對(duì)象能夠繼續(xù)使用xpath方法

先根據(jù)li標(biāo)簽進(jìn)行分組，之后再進(jìn)行數(shù)據(jù)的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根據(jù)li標(biāo)簽進(jìn)行分組
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
#在每一組中繼續(xù)進(jìn)行數(shù)據(jù)的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

結(jié)果是：