python使用html2text庫實現(xiàn)從HTML轉markdown的方法詳解
如果PyPi上搜html2text的話,找到的是另外一個庫:Alir3z4/html2text。這個庫是從aaronsw/html2text fork過來,并在此基礎上對功能進行了擴展。因此是直接用pip安裝的,因此本文主要來講講這個庫。
首先,進行安裝:
pip install html2text
命令行方式使用html2text
安裝完后,就可以通過命令html2text進行一系列的操作了。
html2text命令使用方式為:html2text [(filename|url) [encoding]]。通過html2text -h,我們可以查看該命令支持的選項:
選項 | 描述 |
---|---|
--version |
顯示程序版本號并退出 |
-h, --help |
顯示幫助信息并退出 |
--no-wrap-links |
轉換期間包裝鏈接 |
--ignore-emphasis |
對于強調,不包含任何格式 |
--reference-links |
使用參考樣式的鏈接,而不是內聯(lián)鏈接 |
--ignore-links |
對于鏈接,不包含任何格式 |
--protect-links |
保護鏈接不換行,并用尖角括號將其圍起來 |
--ignore-images |
對于圖像,不包含任何格式 |
--images-to-alt |
丟棄圖像數(shù)據(jù),只保留替換文本 |
--images-with-size |
將圖像標簽作為原生html,并帶height和width屬性,以保留維度 |
-g, --google-doc |
轉換一個被導出為html的谷歌文檔 |
-d, --dash-unordered-list |
對于無序列表,使用破折號而不是星號 |
-e, --asterisk-emphasis |
對于被強調文本,使用星號而不是下劃線 |
-b BODY_WIDTH, --body-width=BODY_WIDTH |
每個輸出行的字符數(shù),0表示不自動換行 |
-i LIST_INDENT, --google-list-indent=LIST_INDENT |
Google縮進嵌套列表的像素數(shù) |
-s, --hide-strikethrough |
隱藏帶刪除線文本。只有當也指定-g的時候才有用 |
--escape-all |
轉義所有特殊字符。輸出較為不可讀,但是會避免極端情況下的格式化問題。 |
--bypass-tables |
以HTML格式格式化表單,而不是Markdown語法。 |
--single-line-break |
在一個塊元素后使用單個換行符,而不是兩個換行符。注意:要求–body-width=0 |
--unicode-snob |
整個文檔中都使用unicode |
--no-automatic-links |
在任何適用情況下,不要使用自動鏈接 |
--no-skip-internal-links |
不要跳過內部鏈接 |
--links-after-para |
將鏈接置于每段之后而不是文檔之后 |
--mark-code |
用 復制代碼 代碼如下: … 將代碼塊標記出來 |
--decode-errors=DECODE_ERRORS |
如何處理decode錯誤。接受值為'ignore', ‘strict'和'replace' |
具體使用如下:
# 傳遞url html2text http://eepurl.com/cK06Gn # 傳遞文件名,編碼方式設置為utf-8 html2text test.html utf-8
腳本中使用html2text
除了直接通過命令行使用html2text外,我們還可以在腳本中將其作為庫導入。
我們以以下html文本為例
html_content = """ <span style="font-size:14px"><a rel="external nofollow" target="_blank" style="color: #1173C7;text-decoration: underline;font-weight: bold;">Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data</a></span><br> A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.<br> """
一句話轉換html文本為Markdown格式的文本:
import html2text print html2text.html2text(html_content)
輸出如下:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
另外,還可以使用上面的配置項:
import html2text h = html2text.HTML2Text() print h.handle(html_content) # 輸出同上
注意:下面僅展示使用某個配置項時的輸出,不使用某個配置項時使用默認值的輸出(如無特殊說明)同上。
--ignore-emphasis
指定選項–ignore-emphasis
h.ignore_emphasis = True print h.handle("<p>hello, this is <em>Ele</em></p>")
輸出為:
hello, this is Ele
不指定選項–ignore-emphasis
h.ignore_emphasis = False # 默認值 print h.handle("<p>hello, this is <em>Ele</em></p>")
輸出為:
hello, this is _Ele_
--reference-links
h.inline_links = False print h.handle(html_content)
輸出為:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data][16]
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
[16]: http://blog.yhat.com/posts/visualize-nba-pipelines.html
--ignore-links
h.ignore_links = True print h.handle(html_content)
輸出為:
Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
--protect-links
h.protect_links = True print h.handle(html_content)
輸出為:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA
Data](<http://blog.yhat.com/posts/visualize-nba-pipelines.html>)
A tutorial using pandas and a few other packages to build a simple datapipe
for getting NBA data. Even though this tutorial is done using NBA data, you
don't need to be an NBA fan to follow along. The same concepts and techniques
can be applied to any project of your choosing.
--ignore-images
h.ignore_images = True print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')
輸出為:
This is a img: ending ...
--images-to-alt
h.images_to_alt = True print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')
輸出為:
This is a img: hot3 ending ...
--images-with-size
h.images_with_size = True print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" height=32px width=32px alt="hot3"> ending ...</p>')
輸出為:
This is a img: <img src='https://my.oschina.net/img/hot3.png' width='32px'
height='32px' alt='hot3' /> ending ...
--body-width
h.body_width=0 print h.handle(html_content)
輸出為:
[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)
A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.
--mark-code
h.mark_code=True print h.handle('<pre class="hljs css"><code class="hljs css"> <span class="hljs-selector-tag"><span class="hljs-selector-tag">rpm</span></span> <span class="hljs-selector-tag"><span class="hljs-selector-tag">-Uvh</span></span> <span class="hljs-selector-tag"><span class="hljs-selector-tag">erlang-solutions-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.0-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.noarch</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.rpm</span></span></code></pre>')
輸出為:
復制代碼 代碼如下:rpm -Uvh erlang-solutions-1.0-1.noarch.rpm
通過這種方式,就可以以腳本的形式自定義HTML -> MARKDOWN的自動化過程了。例子可參考下面的例子
#-*- coding: utf-8 -*- import sys reload(sys) sys.setdefaultencoding('utf-8') import re import requests from lxml import etree import html2text # 獲取第一個issue def get_first_issue(url): resp = requests.get(url) page = etree.HTML(resp.text) issue_list = page.xpath("http://ul[@id='archive-list']/div[@class='display_archive']/li/a") fst_issue = issue_list[0].attrib fst_issue["text"] = issue_list[0].text return fst_issue # 獲取issue的內容,并轉成markdown def get_issue_md(url): resp = requests.get(url) page = etree.HTML(resp.text) content = page.xpath("http://table[@id='templateBody']")[0]#'//table[@class="bodyTable"]')[0] h = html2text.HTML2Text() h.body_width=0 # 不自動換行 return h.handle(etree.tostring(content)) subtitle_mapping = { '**From Our Sponsor**': '# 來自贊助商', '**News**': '# 新聞', '**Articles**,** Tutorials and Talks**': '# 文章,教程和講座', '**Books**': '# 書籍', '**Interesting Projects, Tools and Libraries**': '# 好玩的項目,工具和庫', '**Python Jobs of the Week**': '# 本周的Python工作', '**New Releases**': '# 最新發(fā)布', '**Upcoming Events and Webinars**': '# 近期活動和網絡研討會', } def clean_issue(content): # 去除‘Share Python Weekly'及后面部分 content = re.sub('\*\*Share Python Weekly.*', '', content, flags=re.IGNORECASE) # 預處理標題 for k, v in subtitle_mapping.items(): content = content.replace(k, v) return content tpl_str = """原文:[{title}]({url}) --- {content} """ def run(): issue_list_url = "https://us2.campaign-archive.com/home/?u=e2e180baf855ac797ef407fc7&id=9e26887fc5" print "開始獲取最新的issue……" fst = get_first_issue(issue_list_url) #fst = {'href': 'http://eepurl.com/dqpDyL', 'title': 'Python Weekly - Issue 341'} print "獲取完畢。開始截取最新的issue內容并將其轉換成markdown格式" content = get_issue_md(fst['href']) print "開始清理issue內容" content = clean_issue(content) print "清理完畢,準備將", fst['title'], "寫入文件" title = fst['title'].replace('- ', '').replace(' ', '_') with open(title.strip()+'.md', "wb") as f: f.write(tpl_str.format(title=fst['title'], url=fst['href'], content=content)) print "恭喜,完成啦。文件保存至%s.md" % title if __name__ == '__main__': run()
這是一個每周跑一次的python weekly轉markdown的腳本。
好啦,html2text就介紹到這里了。如果覺得它還不能滿足你的要求,或者想添加更多的功能,可以fork并自行修改。
相關文章
詳解Python中的Numpy、SciPy、MatPlotLib安裝與配置
這篇文章主要介紹了詳解Python中的Numpy、SciPy、MatPlotLib安裝與配置,具有一定的參考價值,感興趣的小伙伴們可以參考一下2017-11-11Python機器學習入門(二)之Python數(shù)據(jù)理解
這篇文章主要介紹了Python機器學習入門知識,本文給大家介紹的非常詳細,對大家的學習或工作具有一定的參考借鑒價值,需要的朋友可以參考下2021-08-08python可視化篇之流式數(shù)據(jù)監(jiān)控的實現(xiàn)
這篇文章主要介紹了python可視化篇之流式數(shù)據(jù)監(jiān)控的實現(xiàn),文中通過示例代碼介紹的非常詳細,對大家的學習或者工作具有一定的參考學習價值,需要的朋友們下面隨著小編來一起學習學習吧2019-08-08