如何使用Python中的正則表達(dá)式處理html文件
使用Python中的正則表達(dá)式處理html文件
finditer方法是一種全匹配方法。您可能已經(jīng)使用了findall方法,它返回多個(gè)匹配字符串的列表。finditer返回一個(gè)迭代器順序地為多個(gè)匹配中的每一個(gè)生成匹配對(duì)象。在下面的代碼中,這些匹配對(duì)象被訪問(通過for循環(huán)),因此可以打印組1。
您的任務(wù)是編寫Python RE來識(shí)別HTML文本文件中的某些模式。將代碼添加到STARTER腳本為這些模式編譯RE(將它們分配給有意義的變量名稱),并將這些RE應(yīng)用于文件的每一行,打印出找到的匹配項(xiàng)。
1.編寫識(shí)別HTML標(biāo)簽的模式,然后將其打印為“TAG:TAG string”(例如“TAG:b”代表標(biāo)簽)。為了簡(jiǎn)單起見,假設(shè)左括號(hào)和右括號(hào)每個(gè)標(biāo)記的(<,>)將始終出現(xiàn)在同一行文本中。第一次嘗試可能使regex“<.*>”其中“.”是與任何字符匹配的預(yù)定義字符類符號(hào)。嘗試找出這一點(diǎn),找出為什么這不是一個(gè)好的解決方案。編寫一個(gè)更好的解決方案,解決這個(gè)問題
2.修改代碼,使其區(qū)分開頭和結(jié)尾標(biāo)記(例如p與/p)打印OPENTAG和CLOSETAG
import sys, re
#------------------------------
testRE = re.compile('(logic|sicstus)', re.I)
testI = re.compile('<[A-Za-z]>', re.I)
testO = re.compile('<[^/](\S*?)[^>]*>')
testC = re.compile('</(\S*?)[^>]*>')
with open('RGX_DATA.html') as infs:
linenum = 0
for line in infs:
linenum += 1
if line.strip() == '':
continue
print(' ', '-' * 100, '[%d]' % linenum, '\n TEXT:', line, end='')
m = testRE.search(line)
if m:
print('** TEST-RE:', m.group(1))
mm = testRE.finditer(line)
for m in mm:
print('** TEST-RE:', m.group(1))
index= testI.finditer(line)
for i in index:
print('Tag:',i.group().replace('<', '').replace('>', ''))
open1= testO.finditer(line)
for m in open1:
print('opening:',m.group().replace('<', '').replace('>', ''))
close1= testC.finditer(line)
for n in close1:
print('closing:',n.group().replace('<', '').replace('>', ''))請(qǐng)注意,有些HTML標(biāo)簽有參數(shù),例如:
<table border=1 cellspacing=0 cellpadding=8>
確保打開標(biāo)記的模式適用于帶參數(shù)和不帶參數(shù)的標(biāo)記,即成功找到并打印標(biāo)簽標(biāo)簽?,F(xiàn)在擴(kuò)展您的代碼,以便打印兩個(gè)打開的標(biāo)簽標(biāo)簽和參數(shù),例如:
OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8
open1= testO.finditer(line)
for m in open1:
#print('opening:',m.group().replace('<', '').replace('>', ''))
firstm= m.group().replace('<', '').replace('>', '').split()
num = 0
for otherm in firstm:
if num == 0:
print('opening:',otherm)
else:
print('pram:',otherm)
num+= 1
在正則表達(dá)式中,可以使用反向引用來指示匹配早期部分的子字符串,應(yīng)再次出現(xiàn)正則表達(dá)式的。格式為\N(其中N為正整數(shù)),并返回到第N個(gè)匹配的文本正則表達(dá)式組。例如,正則表達(dá)式,如:r" (\w+) \1 僅當(dāng)與組(\w+)完全匹配的字符串再次出現(xiàn)時(shí)才匹配 backref\1出現(xiàn)的位置。這可能與字符串“踢”匹配.例如,“the”出現(xiàn)兩次。使用反向引用編寫一個(gè)模式,當(dāng)一行包含成對(duì)的open和關(guān)閉標(biāo)簽,例如在粗體中.
考慮到我們可能想要?jiǎng)?chuàng)建一個(gè)執(zhí)行HTML剝離的腳本,即一個(gè)HTML文件,并返回一個(gè)純文本文件,所有HTML標(biāo)記都已從中刪除出來這里我們不打算這樣做,而是考慮一個(gè)更簡(jiǎn)單的例子,即刪除我們?cè)谳斎霐?shù)據(jù)文件的任何行中找到的HTML標(biāo)記。
你應(yīng)該能夠讓您已經(jīng)定義的RE識(shí)別HTML標(biāo)簽這樣做,將生成的文本打印到屏幕上為STRIPPED:。。
import sys, re
#------------------------------
# PART 1:
# Key thing is to avoid matching strings that include
# multiple tags, e.g. treating '<p><b>' as a single
# tag. Can do this in several ways. Firstly, use
# non-greedy matching, so get shortest possible match
# including the two angle brackets:
tag = re.compile('</?(.*?)>')
# The above treats the '/' of a close tag as a separate
# optional component - so that this doesn't turn up as
# part of the match '.group(1)', which is meant to return
# the tag label.
# Following alternative solution uses a negated character
# class to explicitly prevent this including '>':
tag = re.compile('</?([^>]+)>')
# Finally, following version separates finding the tag
# label string from any (optional) parameters that might
# also appear before the close angle bracket:
tag = re.compile(r'</?(\w+\b)([^>]+)?>')
# Note that use of '\b' (as word boundary anchor) here means
# we must mark the regex string as a 'raw' string (r'..').
#------------------------------
# PART 2:
# Following closeTag definition requires first first char
# after the open angle bracket to be '/', while openTag
# definition excludes this by requiring first char to be
# a 'word char' (\w):
openTag = re.compile(r'<(\w[^>]*)>')
closeTag = re.compile(r'</([^>]*)>')
# Following revised definitions are more carefully stated
# for correct extraction of tag label (separately from
# any parameters:
openTag = re.compile(r'<(\w+\b)([^>]+)?>')
closeTag = re.compile(r'</(\w+\b)\s*>')
#------------------------------
# PART 3:
# Above openTag definition will already get the string
# encompassing any parameters, and return it as
# m.group(2), i.e. defn:
openTag = re.compile(r'<(\w+\b)([^>]+)?>')
# If assume that parameters are continuous non-whitespace
# chars separated by whitespace chars, then we can divide
# them up using split - and that's how we handle them
# here. (In reality, parameter strings can be a lot more
# messy than this, but we won't try to deal with that.)
#------------------------------
# PART 4:
openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)</\1\s*>')
# Note use of non-greedy matching for the text falling
# *between* the open/close tag pair - to avoid false
# results where have two similar tag pairs on same line.
#------------------------------
# PART 5: URLS
# This is quite tricky. The URL expressions in the file
# are of two kinds, of which the first is a string
# between double quotes ("..") which may include
# whitespace. For this case we might have a regex:
url = re.compile('href=("[^">]+")', re.I)
# The second case does not have quotes, and does not
# allow whitespace, consisting of a continuous sequence
# of non-whitespace material (that ends when you reach a
# space or close bracket '>'). This might be:
url = re.compile('href=([^">\s]+)', re.I)
# We can combine these two cases as follows, and still
# get the expression back as group(1):
url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I)
# Note that I've done nothing here to exclude 'mailto:'
# links as being accepted as URLS.
#------------------------------
with open('RGX_DATA.html') as infs:
linenum = 0
for line in infs:
linenum += 1
if line.strip() == '':
continue
print(' ', '-' * 100, '[%d]' % linenum, '\n TEXT:', line, end='')
# PART 1: find HTML tags
# (The following uses 'finditer' to find ALL matches
# within the line)
mm = tag.finditer(line)
for m in mm:
print('** TAG:', m.group(1), ' + [%s]' % m.group(2))
# PART 2,3: find open/close tags (+ params of open tags)
mm = openTag.finditer(line)
for m in mm:
print('** OPENTAG:', m.group(1))
if m.group(2):
for param in m.group(2).split():
print(' PARAM:', param)
mm = closeTag.finditer(line)
for m in mm:
print('** CLOSETAG:', m.group(1))
# PART 4: find open/close tag pairs appearing on same line
mm = openCloseTagPair.finditer(line)
for m in mm:
print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3)))
# PART 5: find URLs:
mm = url.finditer(line)
for m in mm:
print('** URL:', m.group(1))
# PART 6: Strip out HTML tags (note that .sub will do all
# possible substitutions, unless number is limited by count
# keyword arg - which is fortunately what we want here)
stripped = tag.sub('', line)
print('** STRIPPED:', stripped, end = '')
總結(jié)
到此這篇關(guān)于如何使用Python中的正則表達(dá)式處理html文件的文章就介紹到這了,更多相關(guān)Python正則處理html文件內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
Python3下錯(cuò)誤AttributeError: ‘dict’ object has no attribute’ite
這篇文章主要跟大家介紹了關(guān)于在Python3下錯(cuò)誤AttributeError: 'dict' object has no attribute 'iteritems'的分析與解決方法,文中介紹的非常詳細(xì),對(duì)大家具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面來一起看看吧。2017-07-07
python2.x實(shí)現(xiàn)人民幣轉(zhuǎn)大寫人民幣
這篇文章主要為大家詳細(xì)介紹了python2.x實(shí)現(xiàn)人民幣轉(zhuǎn)大寫人民幣,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2018-06-06
Python實(shí)現(xiàn)多個(gè)視頻合成一個(gè)視頻的功能
這篇文章主要介紹了可以將多個(gè)視頻拼接為一個(gè)視頻的Python工具代碼,文中的代碼講解詳細(xì),對(duì)我們學(xué)習(xí)Python有一定的幫助,快來跟隨小編一起學(xué)習(xí)一下吧2021-12-12
python 類的繼承 實(shí)例方法.靜態(tài)方法.類方法的代碼解析
這篇文章主要介紹了python 類的繼承 實(shí)例方法.靜態(tài)方法.類方法的代碼解析,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友可以參考下2019-08-08
Python+seaborn實(shí)現(xiàn)聯(lián)合分布圖的繪制
聯(lián)合分布(Joint Distribution)圖是一種查看兩個(gè)或兩個(gè)以上變量之間兩兩相互關(guān)系的可視化圖,在數(shù)據(jù)分析操作中經(jīng)常需要用到。本文將通過seaborn實(shí)現(xiàn)繪制聯(lián)合分布圖,需要的可以參考一下2023-02-02
python數(shù)字圖像處理之圖像自動(dòng)閾值分割示例
這篇文章主要為大家介紹了python數(shù)字圖像處理之圖像自動(dòng)閾值分割示例,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2022-06-06
詳解Python?AdaBoost算法的實(shí)現(xiàn)
Boosting是機(jī)器學(xué)習(xí)的三大框架之一。Boost也被稱為增強(qiáng)學(xué)習(xí)或提升法,其中典型的代表算法是AdaBoost算法。本文介紹了AdaBoost算法及python實(shí)現(xiàn),感興趣的可以學(xué)習(xí)一下2022-10-10

