Python如何使用組合方式構(gòu)建復(fù)雜正則
正則寫復(fù)雜了很麻煩,難寫難調(diào)試,只需要兩個函數(shù),就能用簡單正則組合構(gòu)建復(fù)雜正則:
比如輸入一個字符串規(guī)則,可以使用 {name}
引用前面定義的規(guī)則:
# rules definition rules = r''' protocol = http|https login_name = [^:@\r\n\t ]+ login_pass = [^@\r\n\t ]+ login = {login_name}(:{login_pass})? host = [^:/@\r\n\t ]+ port = \d+ optional_port = (?:[:]{port})? path = /[^\r\n\t ]* url = {protocol}://({login}[@])?{host}{optional_port}{path}? '''
然后調(diào)用 regex_build
函數(shù),將上面的規(guī)則轉(zhuǎn)換成一個字典并輸出:
# expand patterns in a dictionary m = regex_build(rules, capture = True) # list generated patterns for k, v in m.items(): print(k, '=', v)
結(jié)果:
protocol = (?P<protocol>http|https)
login_name = (?P<login_name>[^:@\r\n\t ]+)
login_pass = (?P<login_pass>[^@\r\n\t ]+)
login = (?P<login>(?P<login_name>[^:@\r\n\t ]+)(:(?P<login_pass>[^@\r\n\t ]+))?)
host = (?P<host>[^:/@\r\n\t ]+)
port = (?P<port>\d+)
optional_port = (?P<optional_port>(?:[:](?P<port>\d+))?)
path = (?P<path>/[^\r\n\t ]*)
url = (?P<url>(?P<protocol>http|https)://((?P<login>(?P<login_name>[^:@\r\n\t ]+)(:(?P<login_pass>[^@\r\n\t ]+))?)[@])?(?P<host>[^:/@\r\n\t ]+)(?P<optional_port>(?:[:](?P<port>\d+))?)(?P<path>/[^\r\n\t ]*)?)
用手寫直接寫是很難寫出這么復(fù)雜的正則的,寫出來也很難調(diào)試,而組合方式構(gòu)建正則的話,可以將小的簡單正則提前測試好,要用的時候再組裝起來,就不容易出錯,上面就是組裝替換后的結(jié)果。
下面用里面的 url 這個規(guī)則來匹配一下:
# 使用規(guī)則 "url" 進行匹配 pattern = m['url'] s = re.match(pattern, 'https://name:pass@www.baidu.com:8080/haha') # 打印完整匹配結(jié)果 print('matched: "%s"'%s.group(0)) print() # 打印分組匹配結(jié)果 for name in ('url', 'login_name', 'login_pass', 'host', 'port', 'path'): print('subgroup:', name, '=', s.group(name))
輸出:
match text with pattern "url"
matched: "https://name:pass@www.baidu.com:8080/haha"
subgroup: url = https://name:pass@www.baidu.com:8080/haha
subgroup: login_name = name
subgroup: login_pass = pass
subgroup: host = www.baidu.com
subgroup: port = 8080
subgroup: path = /haha
可以取完整結(jié)果,也可以按照規(guī)則名字,取得里面具體某個部件得匹配結(jié)果。
這下可以方便的寫復(fù)雜正則表達式了。
再 Python 的正則表達式里 {xxx}
是用來表示長度的,里面都是數(shù)字,如果里面是變量名的話不會和原有規(guī)則沖突,因此這個寫法是安全的。
實現(xiàn)代碼:
import re # 將 pattern 里形如 {name} 的文本,用 macros 里的預(yù)定義規(guī)則替換 def regex_expand(macros, pattern, guarded = True): output = [] pos = 0 size = len(pattern) while pos < size: ch = pattern[pos] if ch == '\\': output.append(pattern[pos:pos + 2]) pos += 2 continue elif ch != '{': output.append(ch) pos += 1 continue p2 = pattern.find('}', pos) if p2 < 0: output.append(ch) pos += 1 continue p3 = p2 + 1 name = pattern[pos + 1:p2].strip('\r\n\t ') if name == '': output.append(pattern[pos:p3]) pos = p3 continue elif name[0].isdigit(): output.append(pattern[pos:p3]) pos = p3 continue elif ('<' in name) or ('>' in name): raise ValueError('invalid pattern name "%s"'%name) if name not in macros: raise ValueError('{%s} is undefined'%name) if guarded: output.append('(?:' + macros[name] + ')') else: output.append(macros[name]) pos = p3 return ''.join(output) # 給定規(guī)則文本,構(gòu)建規(guī)則字典 def regex_build(code, macros = None, capture = True): defined = {} if macros is not None: for k, v in macros.items(): defined[k] = v line_num = 0 for line in code.split('\n'): line_num += 1 line = line.strip('\r\n\t ') if (not line) or line.startswith('#'): continue pos = line.find('=') if pos < 0: raise ValueError('%d: not a valid rule'%line_num) head = line[:pos].strip('\r\n\t ') body = line[pos + 1:].strip('\r\n\t ') if (not head): raise ValueError('%d: empty rule name'%line_num) elif head[0].isdigit(): raise ValueError('%d: invalid rule name "%s"'%(line_num, head)) elif ('<' in head) or ('>' in head): raise ValueError('%d: invalid rule name "%s"'%(line_num, head)) try: pattern = regex_expand(defined, body, guarded = not capture) except ValueError as e: raise ValueError('%d: %s'%(line_num, str(e))) try: re.compile(pattern) except re.error: raise ValueError('%d: invalid pattern "%s"'%(line_num, pattern)) if not capture: defined[head] = pattern else: defined[head] = '(?P<%s>%s)'%(head, pattern) return defined # 定義一套組合規(guī)則 rules = r''' protocol = http|https login_name = [^:@\r\n\t ]+ login_pass = [^@\r\n\t ]+ login = {login_name}(:{login_pass})? host = [^:/@\r\n\t ]+ port = \d+ optional_port = (?:[:]{port})? path = /[^\r\n\t ]* url = {protocol}://({login}[@])?{host}{optional_port}{path}? ''' # 將上面的規(guī)則展開成字典 m = regex_build(rules, capture = True) # 輸出字典內(nèi)容 for k, v in m.items(): print(k, '=', v) print() # 用最終規(guī)則 "url" 匹配文本 pattern = m['url'] s = re.match(pattern, 'https://name:pass@www.baidu.com:8080/haha') # 打印完整匹配 print('matched: "%s"'%s.group(0)) print() # 按名字打印分組匹配 for name in ('url', 'login_name', 'login_pass', 'host', 'port', 'path'): print('subgroup:', name, '=', s.group(name))
完事,主要邏輯 84 行代碼。
到此這篇關(guān)于Python如何使用組合方式構(gòu)建復(fù)雜正則的文章就介紹到這了,更多相關(guān)Python構(gòu)建復(fù)雜正則內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家!
相關(guān)文章
Python+matplotlib實現(xiàn)計算兩個信號的交叉譜密度實例
這篇文章主要介紹了Python+matplotlib實現(xiàn)計算兩個信號的交叉譜密度實例,具有一定借鑒價值,需要的朋友可以參考下2018-01-01教你如何在Pycharm中導(dǎo)入requests模塊
這篇文章主要介紹了教你如何在Pycharm中導(dǎo)入requests模塊,本文給大家介紹的非常詳細,對大家的學(xué)習(xí)或工作具有一定的參考借鑒價值,需要的朋友可以參考下2021-09-09python求列表對應(yīng)元素的乘積和的實現(xiàn)
這篇文章主要介紹了python求列表對應(yīng)元素的乘積和的實現(xiàn),文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2021-04-04