python使用html2text庫(kù)實(shí)現(xiàn)從HTML轉(zhuǎn)markdown的方法詳解

更新時(shí)間：2020年02月21日 17:01:55 投稿：WDC

這篇文章主要介紹了python使用html2text庫(kù)實(shí)現(xiàn)從HTML轉(zhuǎn)markdown的方法,需要的朋友可以參考下

如果PyPi上搜html2text的話，找到的是另外一個(gè)庫(kù)：Alir3z4/html2text。這個(gè)庫(kù)是從aaronsw/html2text fork過(guò)來(lái)，并在此基礎(chǔ)上對(duì)功能進(jìn)行了擴(kuò)展。因此是直接用pip安裝的，因此本文主要來(lái)講講這個(gè)庫(kù)。

首先，進(jìn)行安裝：

pip install html2text

命令行方式使用html2text

安裝完后，就可以通過(guò)命令html2text進(jìn)行一系列的操作了。

html2text命令使用方式為：html2text [(filename|url) [encoding]]。通過(guò)html2text -h，我們可以查看該命令支持的選項(xiàng)：

選項(xiàng)	描述
`--version`	顯示程序版本號(hào)并退出
`-h, --help`	顯示幫助信息并退出
`--no-wrap-links`	轉(zhuǎn)換期間包裝鏈接
`--ignore-emphasis`	對(duì)于強(qiáng)調(diào)，不包含任何格式
`--reference-links`	使用參考樣式的鏈接，而不是內(nèi)聯(lián)鏈接
`--ignore-links`	對(duì)于鏈接，不包含任何格式
`--protect-links`	保護(hù)鏈接不換行，并用尖角括號(hào)將其圍起來(lái)
`--ignore-images`	對(duì)于圖像，不包含任何格式
`--images-to-alt`	丟棄圖像數(shù)據(jù)，只保留替換文本
`--images-with-size`	將圖像標(biāo)簽作為原生html，并帶height和width屬性，以保留維度
`-g, --google-doc`	轉(zhuǎn)換一個(gè)被導(dǎo)出為html的谷歌文檔
`-d, --dash-unordered-list`	對(duì)于無(wú)序列表，使用破折號(hào)而不是星號(hào)
`-e, --asterisk-emphasis`	對(duì)于被強(qiáng)調(diào)文本，使用星號(hào)而不是下劃線
`-b BODY_WIDTH, --body-width=BODY_WIDTH`	每個(gè)輸出行的字符數(shù)，0表示不自動(dòng)換行
`-i LIST_INDENT, --google-list-indent=LIST_INDENT`	Google縮進(jìn)嵌套列表的像素?cái)?shù)
`-s, --hide-strikethrough`	隱藏帶刪除線文本。只有當(dāng)也指定-g的時(shí)候才有用
`--escape-all`	轉(zhuǎn)義所有特殊字符。輸出較為不可讀，但是會(huì)避免極端情況下的格式化問(wèn)題。
`--bypass-tables`	以HTML格式格式化表單，而不是Markdown語(yǔ)法。
`--single-line-break`	在一個(gè)塊元素后使用單個(gè)換行符，而不是兩個(gè)換行符。注意：要求–body-width=0
`--unicode-snob`	整個(gè)文檔中都使用unicode
`--no-automatic-links`	在任何適用情況下，不要使用自動(dòng)鏈接
`--no-skip-internal-links`	不要跳過(guò)內(nèi)部鏈接
`--links-after-para`	將鏈接置于每段之后而不是文檔之后
`--mark-code`	用復(fù)制代碼代碼如下: … 將代碼塊標(biāo)記出來(lái)
`--decode-errors=DECODE_ERRORS`	如何處理decode錯(cuò)誤。接受值為'ignore', ‘strict'和'replace'

具體使用如下：

# 傳遞url
html2text http://eepurl.com/cK06Gn

# 傳遞文件名，編碼方式設(shè)置為utf-8
html2text test.html utf-8

腳本中使用html2text

除了直接通過(guò)命令行使用html2text外，我們還可以在腳本中將其作為庫(kù)導(dǎo)入。

我們以以下html文本為例

html_content = """
<span style="font-size:14px"><a  rel="external nofollow" target="_blank" style="color: #1173C7;text-decoration: underline;font-weight: bold;">Data Wrangling 101: Using Python to Fetch, Manipulate &amp; Visualize NBA Data</a></span><br>
A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.<br>
"""

一句話轉(zhuǎn)換html文本為Markdown格式的文本：

import html2text
print html2text.html2text(html_content)

輸出如下：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA

Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

另外，還可以使用上面的配置項(xiàng)：

import html2text
h = html2text.HTML2Text()
print h.handle(html_content) # 輸出同上

注意：下面僅展示使用某個(gè)配置項(xiàng)時(shí)的輸出，不使用某個(gè)配置項(xiàng)時(shí)使用默認(rèn)值的輸出（如無(wú)特殊說(shuō)明）同上。

--ignore-emphasis

指定選項(xiàng)–ignore-emphasis

h.ignore_emphasis = True
print h.handle("<p>hello, this is <em>Ele</em></p>")

輸出為：

hello, this is Ele

不指定選項(xiàng)–ignore-emphasis

h.ignore_emphasis = False # 默認(rèn)值
print h.handle("<p>hello, this is <em>Ele</em></p>")

輸出為：

hello, this is _Ele_

--reference-links

h.inline_links = False
print h.handle(html_content)

輸出為：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA

Data][16]

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

[16]: http://blog.yhat.com/posts/visualize-nba-pipelines.html

--ignore-links

h.ignore_links = True
print h.handle(html_content)

輸出為：

Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

--protect-links

h.protect_links = True
print h.handle(html_content)

輸出為：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA

Data](<http://blog.yhat.com/posts/visualize-nba-pipelines.html>)

A tutorial using pandas and a few other packages to build a simple datapipe

for getting NBA data. Even though this tutorial is done using NBA data, you

don't need to be an NBA fan to follow along. The same concepts and techniques

can be applied to any project of your choosing.

--ignore-images

h.ignore_images = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')

輸出為：

This is a img: ending ...

--images-to-alt

h.images_to_alt = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" style="max-height: 32px; max-width: 32px;" alt="hot3"> ending ...</p>')

輸出為：

This is a img: hot3 ending ...

--images-with-size

h.images_with_size = True
print h.handle('<p>This is a img: <img src="https://my.oschina.net/img/hot3.png" height=32px width=32px alt="hot3"> ending ...</p>')

輸出為：

This is a img: <img src='https://my.oschina.net/img/hot3.png' width='32px'

height='32px' alt='hot3' /> ending ...

--body-width

h.body_width=0
print h.handle(html_content)

輸出為：

[Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data](http://blog.yhat.com/posts/visualize-nba-pipelines.html)

A tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.

--mark-code

h.mark_code=True
print h.handle('<pre class="hljs css"><code class="hljs css">&nbsp;&nbsp;&nbsp;&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">rpm</span></span>&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">-Uvh</span></span>&nbsp;<span class="hljs-selector-tag"><span class="hljs-selector-tag">erlang-solutions-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.0-1</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.noarch</span></span><span class="hljs-selector-class"><span class="hljs-selector-class">.rpm</span></span></code></pre>')

輸出為：

復(fù)制代碼代碼如下:

rpm -Uvh erlang-solutions-1.0-1.noarch.rpm

通過(guò)這種方式，就可以以腳本的形式自定義HTML -> MARKDOWN的自動(dòng)化過(guò)程了。例子可參考下面的例子

#-*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8') 
import re
import requests
from lxml import etree
import html2text


# 獲取第一個(gè)issue
def get_first_issue(url):
  resp = requests.get(url)
  page = etree.HTML(resp.text)
  issue_list = page.xpath("http://ul[@id='archive-list']/div[@class='display_archive']/li/a")
  fst_issue = issue_list[0].attrib
  fst_issue["text"] = issue_list[0].text
  return fst_issue


# 獲取issue的內(nèi)容，并轉(zhuǎn)成markdown
def get_issue_md(url):
  resp = requests.get(url)
  page = etree.HTML(resp.text)
  content = page.xpath("http://table[@id='templateBody']")[0]#'//table[@class="bodyTable"]')[0]
  h = html2text.HTML2Text()
  h.body_width=0 # 不自動(dòng)換行
  return h.handle(etree.tostring(content))

subtitle_mapping = {
  '**From Our Sponsor**': '# 來(lái)自贊助商',
  '**News**': '# 新聞',
  '**Articles**,** Tutorials and Talks**': '# 文章，教程和講座',
  '**Books**': '# 書(shū)籍',
  '**Interesting Projects, Tools and Libraries**': '# 好玩的項(xiàng)目，工具和庫(kù)',
  '**Python Jobs of the Week**': '# 本周的Python工作',
  '**New Releases**': '# 最新發(fā)布',
  '**Upcoming Events and Webinars**': '# 近期活動(dòng)和網(wǎng)絡(luò)研討會(huì)',
}
def clean_issue(content):
  # 去除‘Share Python Weekly'及后面部分
  content = re.sub('\*\*Share Python Weekly.*', '', content, flags=re.IGNORECASE)
  # 預(yù)處理標(biāo)題
  for k, v in subtitle_mapping.items():
    content = content.replace(k, v)
  return content

tpl_str = """原文：[{title}]({url})
---
{content}
"""
def run():
  issue_list_url = "https://us2.campaign-archive.com/home/?u=e2e180baf855ac797ef407fc7&id=9e26887fc5"
  print "開(kāi)始獲取最新的issue……"
  fst = get_first_issue(issue_list_url)
  #fst = {'href': 'http://eepurl.com/dqpDyL', 'title': 'Python Weekly - Issue 341'}
  print "獲取完畢。開(kāi)始截取最新的issue內(nèi)容并將其轉(zhuǎn)換成markdown格式"
  content = get_issue_md(fst['href'])
  print "開(kāi)始清理issue內(nèi)容"
  content = clean_issue(content)

  print "清理完畢，準(zhǔn)備將", fst['title'], "寫(xiě)入文件"
  title = fst['title'].replace('- ', '').replace(' ', '_')
  with open(title.strip()+'.md', "wb") as f:
    f.write(tpl_str.format(title=fst['title'], url=fst['href'], content=content))
  print "恭喜，完成啦。文件保存至%s.md" % title

if __name__ == '__main__':
  run()

這是一個(gè)每周跑一次的python weekly轉(zhuǎn)markdown的腳本。

好啦，html2text就介紹到這里了。如果覺(jué)得它還不能滿足你的要求，或者想添加更多的功能，可以fork并自行修改。

您可能感興趣的文章: