快捷導(dǎo)航

如何使用Python中的正則表達(dá)式處理html文件

更新時(shí)間：2023年04月28日 12:40:40 作者：the?only?KIrsTEN

html類型的文本數(shù)據(jù)內(nèi)容是由前端代碼書寫的標(biāo)簽+文本數(shù)據(jù)的格式,可以直接在chrome瀏覽器打開,清楚的展示出文本的格式,下面這篇文章主要給大家介紹了關(guān)于如何使用Python中的正則表達(dá)式處理html文件的相關(guān)資料,需要的朋友可以參考下

使用Python中的正則表達(dá)式處理html文件

finditer方法是一種全匹配方法。您可能已經(jīng)使用了findall方法，它返回多個(gè)匹配字符串的列表。finditer返回一個(gè)迭代器順序地為多個(gè)匹配中的每一個(gè)生成匹配對(duì)象。在下面的代碼中，這些匹配對(duì)象被訪問（通過for循環(huán)），因此可以打印組1。

您的任務(wù)是編寫Python RE來識(shí)別HTML文本文件中的某些模式。將代碼添加到STARTER腳本為這些模式編譯RE（將它們分配給有意義的變量名稱），并將這些RE應(yīng)用于文件的每一行，打印出找到的匹配項(xiàng)。

1.編寫識(shí)別HTML標(biāo)簽的模式，然后將其打印為“TAG:TAG string”（例如“TAG:b”代表標(biāo)簽）。為了簡(jiǎn)單起見，假設(shè)左括號(hào)和右括號(hào)每個(gè)標(biāo)記的（<，>）將始終出現(xiàn)在同一行文本中。第一次嘗試可能使regex“<.*>”其中“.”是與任何字符匹配的預(yù)定義字符類符號(hào)。嘗試找出這一點(diǎn)，找出為什么這不是一個(gè)好的解決方案。編寫一個(gè)更好的解決方案，解決這個(gè)問題

2.修改代碼，使其區(qū)分開頭和結(jié)尾標(biāo)記（例如p與/p)打印OPENTAG和CLOSETAG

import sys, re

#------------------------------

testRE = re.compile('(logic|sicstus)', re.I)
testI = re.compile('<[A-Za-z]>', re.I)
testO = re.compile('<[^/](\S*?)[^>]*>')
testC = re.compile('</(\S*?)[^>]*>')

with open('RGX_DATA.html') as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == '':
            continue
        print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='')
    
        m = testRE.search(line)
        if m:
            print('** TEST-RE:', m.group(1))

        mm = testRE.finditer(line)
        for m in mm:
            print('** TEST-RE:', m.group(1))
        
        index= testI.finditer(line)
        for i in index:
           print('Tag:',i.group().replace('<', '').replace('>', ''))
           
        open1= testO.finditer(line)
        for m in open1:
           print('opening:',m.group().replace('<', '').replace('>', ''))
           
        close1= testC.finditer(line)
        for n in close1:
           print('closing:',n.group().replace('<', '').replace('>', ''))

請(qǐng)注意，有些HTML標(biāo)簽有參數(shù)，例如：

<table border=1 cellspacing=0 cellpadding=8>

確保打開標(biāo)記的模式適用于帶參數(shù)和不帶參數(shù)的標(biāo)記，即成功找到并打印標(biāo)簽標(biāo)簽?，F(xiàn)在擴(kuò)展您的代碼，以便打印兩個(gè)打開的標(biāo)簽標(biāo)簽和參數(shù)，例如:

OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8

 		open1= testO.finditer(line)
        for m in open1:
            #print('opening:',m.group().replace('<', '').replace('>', ''))
            firstm= m.group().replace('<', '').replace('>', '').split()
            num = 0
            for otherm in firstm:
                if num == 0:
                    print('opening:',otherm)
                else:
                    print('pram:',otherm)
                num+= 1

在正則表達(dá)式中，可以使用反向引用來指示匹配早期部分的子字符串,應(yīng)再次出現(xiàn)正則表達(dá)式的。格式為\N（其中N為正整數(shù)），并返回到第N個(gè)匹配的文本正則表達(dá)式組。例如，正則表達(dá)式，如：r" (\w+) \1 僅當(dāng)與組（\w+）完全匹配的字符串再次出現(xiàn)時(shí)才匹配 backref\1出現(xiàn)的位置。這可能與字符串“踢”匹配.例如，“the”出現(xiàn)兩次。使用反向引用編寫一個(gè)模式，當(dāng)一行包含成對(duì)的open和關(guān)閉標(biāo)簽，例如在粗體中.

考慮到我們可能想要?jiǎng)?chuàng)建一個(gè)執(zhí)行HTML剝離的腳本，即一個(gè)HTML文件，并返回一個(gè)純文本文件，所有HTML標(biāo)記都已從中刪除出來這里我們不打算這樣做，而是考慮一個(gè)更簡(jiǎn)單的例子，即刪除我們?cè)谳斎霐?shù)據(jù)文件的任何行中找到的HTML標(biāo)記。

你應(yīng)該能夠讓您已經(jīng)定義的RE識(shí)別HTML標(biāo)簽這樣做,將生成的文本打印到屏幕上為STRIPPED：。。

import sys, re

#------------------------------
# PART 1: 

   # Key thing is to avoid matching strings that include
   # multiple tags, e.g. treating '<p><b>' as a single
   # tag. Can do this in several ways. Firstly, use
   # non-greedy matching, so get shortest possible match
   # including the two angle brackets:

tag = re.compile('</?(.*?)>') 

   # The above treats the '/' of a close tag as a separate
   # optional component - so that this doesn't turn up as
   # part of the match '.group(1)', which is meant to return
   # the tag label. 
   # Following alternative solution uses a negated character
   # class to explicitly prevent this including '>': 

tag = re.compile('</?([^>]+)>') 

   # Finally, following version separates finding the tag
   # label string from any (optional) parameters that might
   # also appear before the close angle bracket:

tag = re.compile(r'</?(\w+\b)([^>]+)?>') 

   # Note that use of '\b' (as word boundary anchor) here means
   # we must mark the regex string as a 'raw' string (r'..'). 

#------------------------------
# PART 2: 

   # Following closeTag definition requires first first char
   # after the open angle bracket to be '/', while openTag
   # definition excludes this by requiring first char to be
   # a 'word char' (\w):

openTag  = re.compile(r'<(\w[^>]*)>')
closeTag = re.compile(r'</([^>]*)>')

   # Following revised definitions are more carefully stated
   # for correct extraction of tag label (separately from
   # any parameters:

openTag  = re.compile(r'<(\w+\b)([^>]+)?>')
closeTag = re.compile(r'</(\w+\b)\s*>')

#------------------------------
# PART 3: 

   # Above openTag definition will already get the string
   # encompassing any parameters, and return it as
   # m.group(2), i.e. defn: 

openTag  = re.compile(r'<(\w+\b)([^>]+)?>')

   # If assume that parameters are continuous non-whitespace
   # chars separated by whitespace chars, then we can divide
   # them up using split - and that's how we handle them
   # here. (In reality, parameter strings can be a lot more
   # messy than this, but we won't try to deal with that.)

#------------------------------
# PART 4: 

openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)</\1\s*>')

   # Note use of non-greedy matching for the text falling
   # *between* the open/close tag pair - to avoid false
   # results where have two similar tag pairs on same line.

#------------------------------
# PART 5: URLS

   # This is quite tricky. The URL expressions in the file
   # are of two kinds, of which the first is a string
   # between double quotes ("..") which may include
   # whitespace. For this case we might have a regex: 

url = re.compile('href=("[^">]+")', re.I)

   # The second case does not have quotes, and does not
   # allow whitespace, consisting of a continuous sequence
   # of non-whitespace material (that ends when you reach a
   # space or close bracket '>'). This might be: 

url = re.compile('href=([^">\s]+)', re.I)

   # We can combine these two cases as follows, and still
   # get the expression back as group(1):

url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I)

   # Note that I've done nothing here to exclude 'mailto:'
   # links as being accepted as URLS. 

#------------------------------

with open('RGX_DATA.html') as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == '':
            continue
        print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='')
    
        # PART 1: find HTML tags
        # (The following uses 'finditer' to find ALL matches
        # within the line)
    
        mm = tag.finditer(line)
        for m in mm:
            print('** TAG:', m.group(1), ' + [%s]' % m.group(2))
    
        # PART 2,3: find open/close tags (+ params of open tags)
    
        mm = openTag.finditer(line)
        for m in mm:
            print('** OPENTAG:', m.group(1))
            if m.group(2):
                for param in m.group(2).split():
                    print('    PARAM:', param)
    
        mm = closeTag.finditer(line)
        for m in mm:
            print('** CLOSETAG:', m.group(1))
    
        # PART 4: find open/close tag pairs appearing on same line
    
        mm = openCloseTagPair.finditer(line)
        for m in mm:
            print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3)))
    
        # PART 5: find URLs:
    
        mm = url.finditer(line)
        for m in mm:
            print('** URL:', m.group(1))

        # PART 6: Strip out HTML tags (note that .sub will do all
        # possible substitutions, unless number is limited by count
        # keyword arg - which is fortunately what we want here)

        stripped = tag.sub('', line)
        print('** STRIPPED:', stripped, end = '')

總結(jié)

到此這篇關(guān)于如何使用Python中的正則表達(dá)式處理html文件的文章就介紹到這了,更多相關(guān)Python正則處理html文件內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: