欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

Python lxml模塊的基本使用方法分析

 更新時間:2019年12月21日 11:21:51   作者:Dylan HU  
這篇文章主要介紹了Python lxml模塊的基本使用方法,結(jié)合實(shí)例形式分析了Python安裝與使用lxml模塊常見操作技巧與相關(guān)注意事項(xiàng),需要的朋友可以參考下

本文實(shí)例講述了Python lxml模塊的基本使用方法。分享給大家供大家參考,具體如下:

1 lxml的安裝

安裝方式:pip install lxml

2 lxml的使用

2.1 lxml模塊的入門使用

導(dǎo)入lxml 的 etree 庫 (導(dǎo)入沒有提示不代表不能用)

from lxml import etree

利用etree.HTML,將字符串轉(zhuǎn)化為Element對象,Element對象具有xpath的方法,返回結(jié)果的列表,能夠接受bytes類型的數(shù)據(jù)和str類型的數(shù)據(jù)

html = etree.HTML(text) 
ret_list = html.xpath("xpath字符串")

把轉(zhuǎn)化后的element對象轉(zhuǎn)化為字符串,返回bytes類型結(jié)果 etree.tostring(element)

假設(shè)我們現(xiàn)有如下的html字符換,嘗試對他進(jìn)行操作

<div> <ul> 
<li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
<li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
<li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
<li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
<li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> # 注意,此處缺少一個 </li> 閉合標(biāo)簽 
</ul> </div>

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
print(type(html)) 
handeled_html_str = etree.tostring(html).decode()
print(handeled_html_str)

輸出為

<class 'lxml.etree._Element'>
<html><body><div> <ul>
        <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li>
        <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li>
        <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li>
        <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li>
        <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a>
        </li></ul> </div> </body></html>

可以發(fā)現(xiàn),lxml確實(shí)能夠把確實(shí)的標(biāo)簽補(bǔ)充完成,但是請注意lxml是人寫的,很多時候由于網(wǎng)頁不夠規(guī)范,或者是lxml的bug,即使參考url地址對應(yīng)的響應(yīng)去提取數(shù)據(jù),任然獲取不到,這個時候我們需要使用etree.tostring的方法,觀察etree到底把html轉(zhuǎn)化成了什么樣子,即根據(jù)轉(zhuǎn)化后的html字符串去進(jìn)行數(shù)據(jù)的提取。

2.2 lxml的深入練習(xí)

接下來我們繼續(xù)操作,假設(shè)每個class為item-1的li標(biāo)簽是1條新聞數(shù)據(jù),如何把這條新聞數(shù)據(jù)組成一個字典

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a href="link1.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
#獲取href的列表和title的列表
href_list = html.xpath("http://li[@class='item-1']/a/@href")
title_list = html.xpath("http://li[@class='item-1']/a/text()")
#組裝成字典
for href in href_list:
  item = {}
  item["href"] = href
  item["title"] = title_list[href_list.index(href)]
  print(item)

輸出為

{'href': 'link1.html', 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

假設(shè)在某種情況下,某個新聞的href沒有,那么會怎樣呢?

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''

結(jié)果是

{'href': 'link2.html', 'title': 'first item'}
{'href': 'link4.html', 'title': 'second item'}

數(shù)據(jù)的對應(yīng)全部錯了,這不是我們想要的,接下來通過2.3小節(jié)的學(xué)習(xí)來解決這個問題

2.3 lxml模塊的進(jìn)階使用

前面我們?nèi)〉綄傩?,或者是文本的時候,返回字符串 但是如果我們?nèi)〉降氖且粋€節(jié)點(diǎn),返回什么呢?

返回的是element對象,可以繼續(xù)使用xpath方法,對此我們可以在后面的數(shù)據(jù)提取過程中:先根據(jù)某個標(biāo)簽進(jìn)行分組,分組之后再進(jìn)行數(shù)據(jù)的提取

示例如下:

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
print(li_list)

結(jié)果為:

[<Element li at 0x11106cb48>, <Element li at 0x11106cb88>, <Element li at 0x11106cbc8>]

可以發(fā)現(xiàn)結(jié)果是一個element對象,這個對象能夠繼續(xù)使用xpath方法

先根據(jù)li標(biāo)簽進(jìn)行分組,之后再進(jìn)行數(shù)據(jù)的提取

from lxml import etree
text = ''' <div> <ul> 
    <li class="item-1"><a>first item</a></li> 
    <li class="item-1"><a href="link2.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >second item</a></li> 
    <li class="item-inactive"><a href="link3.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >third item</a></li> 
    <li class="item-1"><a href="link4.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fourth item</a></li> 
    <li class="item-0"><a href="link5.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" >fifth item</a> 
    </ul> </div> '''
#根據(jù)li標(biāo)簽進(jìn)行分組
html = etree.HTML(text)
li_list = html.xpath("http://li[@class='item-1']")
#在每一組中繼續(xù)進(jìn)行數(shù)據(jù)的提取
for li in li_list:
  item = {}
  item["href"] = li.xpath("./a/@href")[0] if len(li.xpath("./a/@href"))>0 else None
  item["title"] = li.xpath("./a/text()")[0] if len(li.xpath("./a/text()"))>0 else None
  print(item)

結(jié)果是:

{'href': None, 'title': 'first item'}
{'href': 'link2.html', 'title': 'second item'}
{'href': 'link4.html', 'title': 'fourth item'}

前面的代碼中,進(jìn)行數(shù)據(jù)提取需要判斷,可能某些一面不存在數(shù)據(jù)的情況,對應(yīng)的可以使用三元運(yùn)算符來解決

PS:這里再為大家提供幾款關(guān)于xml操作的在線工具供大家參考使用:

在線XML/JSON互相轉(zhuǎn)換工具:
http://tools.jb51.net/code/xmljson

在線格式化XML/在線壓縮XML
http://tools.jb51.net/code/xmlformat

XML在線壓縮/格式化工具:
http://tools.jb51.net/code/xml_format_compress

XML代碼在線格式化美化工具:
http://tools.jb51.net/code/xmlcodeformat

更多關(guān)于Python相關(guān)內(nèi)容感興趣的讀者可查看本站專題:《Python操作xml數(shù)據(jù)技巧總結(jié)》、《Python數(shù)據(jù)結(jié)構(gòu)與算法教程》、《Python Socket編程技巧總結(jié)》、《Python函數(shù)使用技巧總結(jié)》、《Python字符串操作技巧匯總》、《Python入門與進(jìn)階經(jīng)典教程》及《Python文件與目錄操作技巧匯總

希望本文所述對大家Python程序設(shè)計(jì)有所幫助。

相關(guān)文章

  • Django日志和調(diào)試工具欄實(shí)現(xiàn)高效的應(yīng)用程序調(diào)試和性能優(yōu)化

    Django日志和調(diào)試工具欄實(shí)現(xiàn)高效的應(yīng)用程序調(diào)試和性能優(yōu)化

    這篇文章主要介紹了Django日志和調(diào)試工具欄實(shí)現(xiàn)高效的應(yīng)用程序調(diào)試和性能優(yōu)化,Django日志和調(diào)試工具欄為開發(fā)者提供了快速定位應(yīng)用程序問題的工具,可提高調(diào)試和性能優(yōu)化效率,提高應(yīng)用程序的可靠性和可維護(hù)性
    2023-05-05
  • pytorch中交叉熵?fù)p失函數(shù)的使用小細(xì)節(jié)

    pytorch中交叉熵?fù)p失函數(shù)的使用小細(xì)節(jié)

    這篇文章主要介紹了pytorch中交叉熵?fù)p失函數(shù)的使用細(xì)節(jié),具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2023-02-02
  • python中的編碼和解碼及\x和\u問題

    python中的編碼和解碼及\x和\u問題

    這篇文章主要介紹了python中的編碼和解碼及\x和\u問題,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2022-05-05
  • Python?抖音評論數(shù)據(jù)抓取分析

    Python?抖音評論數(shù)據(jù)抓取分析

    大家好,最近抖音張同學(xué)突然火了,兩個月漲粉一千多萬。今天這篇文章,我抓取了張同學(xué)的視頻的評論數(shù)據(jù),想從文本分析的角度,挖掘一下大家對張同學(xué)感興趣的點(diǎn)
    2022-01-01
  • python如何使用replace做多字符替換

    python如何使用replace做多字符替換

    這篇文章主要介紹了python如何使用replace做多字符替換,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2022-05-05
  • pytorch單元測試的實(shí)現(xiàn)示例

    pytorch單元測試的實(shí)現(xiàn)示例

    單元測試是一種軟件測試方法,本文主要介紹了pytorch單元測試的實(shí)現(xiàn)示例,文中通過示例代碼介紹的非常詳細(xì),對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧
    2024-04-04
  • python寫入csv時writerow()和writerows()函數(shù)簡單示例

    python寫入csv時writerow()和writerows()函數(shù)簡單示例

    這篇文章主要給大家介紹了關(guān)于python寫入csv時writerow()和writerows()函數(shù)的相關(guān)資料,writerows和writerow是Python中csv模塊中的兩個函數(shù),用于將數(shù)據(jù)寫入CSV文件,需要的朋友可以參考下
    2023-07-07
  • 還不知道Anaconda是什么?讀這一篇文章就夠了

    還不知道Anaconda是什么?讀這一篇文章就夠了

    Anaconda指的是一個開源的Python發(fā)行版本,其包含了Conda、Python等180多個科學(xué)包及其依賴項(xiàng),下面這篇文章主要給大家介紹了關(guān)于Anaconda是什么的相關(guān)資料,需要的朋友可以參考下
    2023-02-02
  • Windows下Python的Django框架環(huán)境部署及應(yīng)用編寫入門

    Windows下Python的Django框架環(huán)境部署及應(yīng)用編寫入門

    這篇文章主要介紹了Windows下Python的Django框架環(huán)境部署及程序編寫入門,Django在Python的框架中算是一個重量級的MVC框架,本文將從程序部署開始講到hellow world web應(yīng)用的編寫,需要的朋友可以參考下
    2016-03-03
  • Python+PyQt5+MySQL實(shí)現(xiàn)天氣管理系統(tǒng)

    Python+PyQt5+MySQL實(shí)現(xiàn)天氣管理系統(tǒng)

    這篇文章主要為大家詳細(xì)介紹了Python+PyQt5+MySQL實(shí)現(xiàn)天氣管理系統(tǒng),文中示例代碼介紹的非常詳細(xì),具有一定的參考價值,感興趣的小伙伴們可以參考一下
    2020-06-06

最新評論