快捷導(dǎo)航

詳解Python匹配多行文本塊的正則表達式

更新時間：2023年06月14日 10:01:20 作者：跡憶客

這篇文章主要介紹了Python?匹配多行文本塊的正則表達式,該解決方案折衷了已知和未知模式的幾種方法，并解釋了匹配模式的工作原理，本文給大家介紹的非常詳細，需要的朋友可以參考下

編寫正則表達式以匹配多行字符串的原因

假設(shè)我們有以下文本塊：

Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.\n
\n
IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records.

從上面給出的文本塊中，需要找到起始文本，文本在下面幾行呈現(xiàn)。重要的是要注意 \n 表示換行符而不是文字文本。

總而言之，我們想要跨多行查找和匹配文本，忽略文本之間可能出現(xiàn)的任何空行。在上述文本的情況下，它應(yīng)該返回 Any compiled body… 行，并且 IBM 首先在單個正則表達式查詢中使用了 term… 行。

匹配多行字符串的可能解決方案

在討論這個特定問題的解決方案之前，必須了解 regex（正則表達式）API 的不同方面，尤其是那些在整個解決方案中經(jīng)常使用的方面。

那么，讓我們從 re.compile() 開始吧。

Python re.compile() 方法

re.compile() 將正則表達式模式編譯為正則表達式對象，我們可以使用該對象與 match()、search() 和其他描述的方法進行匹配。

re.compile() 相對于未編譯模式的優(yōu)勢之一是可重用性。我們可以多次使用已編譯的表達式，而不是為每個未編譯的模式聲明一個新字符串。

import re as regex
pattern = regex.compile(".+World")
print(pattern.match("Hello World!"))
print(pattern.search("Hello World!"))

輸出:

<re.Match object; span=(0, 11), match='Hello World'>
<re.Match object; span=(0, 11), match='Hello World'>

Python re.search() 方法

re.search() 在字符串中搜索匹配項，如果找到則返回一個 Match 對象。

如果存在多個匹配項，我們將返回第一個實例。

我們也可以不使用re.compile()直接使用，適用于只需要查詢一次的情況。

import re as regex
print(regex.search(".+World", "Hello World!"))

輸出:

<re.Match object; span=(0, 11), match='Hello World'>

Python re.finditer() 方法

re.finditer() 匹配字符串中的模式并返回一個迭代器，該迭代器為所有非重疊匹配項提供 Match 對象。

然后我們可以使用迭代器迭代匹配項并執(zhí)行必要的操作；匹配按照它們在字符串中從左到右的找到方式排序。

import re as regex
matches = regex.finditer(r'[aeoui]', 'vowel letters')
for match in matches:
    print(match)

輸出:

<re.Match object; span=(1, 2), match='o'>
<re.Match object; span=(3, 4), match='e'>
<re.Match object; span=(7, 8), match='e'>
<re.Match object; span=(10, 11), match='e'>

Python re.findall() 方法

re.findall() 返回字符串中模式的所有非重疊匹配項的列表或元組。從左到右掃描一個字符串。并且匹配按照它們被發(fā)現(xiàn)的順序返回。

import re as regex
# Find all capital words
string= ',,21312414.ABCDEFGw#########'
print(regex.findall(r'[A-Z]+', string))

輸出:

['ABCDEFG']

Python re.MULTILINE 方法

re.MULTILINE 的一個顯著優(yōu)勢是它允許 ^ 在每一行的開頭而不是僅在字符串的開頭搜索模式。

Python 正則表達式符號

當(dāng)以復(fù)雜的方式使用時，正則表達式符號很快就會變得非?；靵y。以下是我們解決方案中使用的一些符號，以幫助更好地理解這些符號的基本概念。

^ 斷言行首的位置
字符串匹配（區(qū)分大小寫的）字符“字符串”
. 匹配所有字符（用于行終止的符號除外）
盡可能頻繁地匹配先前給定的標(biāo)記。
\n 匹配換行符
\r 匹配一個 (CR) 回車符
？與前一個標(biāo)記匹配 0-1 次
+？盡可能少地匹配前一個標(biāo)記 1 到無限次。
a-z 匹配 a 和 z 之間范圍內(nèi)的單個字符（區(qū)分大小寫）

使用 re.compile() 匹配 Python 中的多行文本塊

讓我們了解使用不同的模式。

示例代碼：

import re as regex
multiline_string = "Regular\nExpression"
print(regex.search(r'^Expression', multiline_string, regex.MULTILINE))

輸出:

<re.Match object; span=(8, 18), match='Expression'>

上面的表達式首先斷言它在行首的位置（由于 ^），然后搜索“表達式”的確切出現(xiàn)。

使用 MULTILINE 標(biāo)志確保檢查每一行是否出現(xiàn)“表達式”，而不僅僅是第一行。

示例代碼：

import re as regex
data = """Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.\n
\n
IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records.
"""
result = regex.compile(r"^(.+)(?:\n|\r\n)+((?:(?:\n|\r\n?).+)+)", regex.MULTILINE)
print(result.search(data)[0].replace("\n", ""))

輸出:

Any compiled body of information is known as a data set. Depending on the situation's specifics, this may be a database or a simple array.IBM first used the term "data set," which meant essentially the same thing as "file," to describe a collection of related records.

正則表達式可以分解并簡化為更小的塊以提高可讀性：

在第一個捕獲組 (.+) 中，每個字符都在行中匹配（除了與行終止符對應(yīng)的任何符號）；這個過程盡可能頻繁地進行。

之后，在非捕獲組 (?:\n|\r\n) 中，盡可能多地匹配一個行結(jié)束符或者一個行結(jié)束符加回車。

至于第二個捕獲組 ((?:(?:\n|\r\n?).+)+，它由一個非捕獲組 (?:(?:\n|\r\n? ).+)+ 換行符或換行符加回車最多匹配一次。

每個字符都在非捕獲組之外匹配，不包括行終止符。盡可能多地執(zhí)行此過程。

示例代碼：

import re as regex
data = """Regex In Python
Regex is a feature available in all programming languages used to find patterns in text or data.
"""
query=regex.compile(r"^(.+?)\n([\a-z]+)",regex.MULTILINE)
for match in query.finditer(data):
    topic, content = match.groups()
    print ("Topic:",topic)
    print ("Content:",content)

輸出:

Topic: Regex In Python
Content:
Regex is a feature available in all programming languages used to find patterns in text or data.

上面的表達式可以解釋如下：

在第一個捕獲組 (.+?) 中，盡可能少地匹配所有字符（除了行終止符，和以前一樣）。之后，匹配單個換行符 \n。

匹配換行符后，在第二個捕獲組 (\n[a-z ]+) 中進行如下操作。首先，匹配換行符，然后盡可能多次匹配 a-z 之間的字符。

使用 re.findall() 在 Python 中匹配多行文本塊

示例代碼：

import re as regex
data = """When working with regular expressions, the sub() function of the re library is an invaluable tool.
the subroutine looks over the string for the given pattern and applies the given replacement to all instances where it is found.
"""
query = regex.findall('([^\n\r]+)[\n\r]([a-z \n\r]+)',data)
for results in query:
    for result in results:
        print(result.replace("\n",""))

輸出:

When working with regular expressions, the sub() function of the re library is an invaluable tool.
the subroutine looks over the string for the given pattern and applies the given replacement to all instances where it is found

為了更好地理解正則表達式的解釋，讓我們按每個組對其進行分解，看看每個部分的作用。

在第一個捕獲組 ([^\n\r]+) 中，盡可能多地匹配所有字符，不包括換行符或回車符。

之后，當(dāng)字符是表達式 [\n\r] 中的回車符或換行符時進行匹配。

在第二個捕獲組 ([a-z \n\r]+) 中，a-z 或換行符或回車符之間的字符盡可能多地匹配。

到此這篇關(guān)于Python - 匹配多行文本塊的正則表達式的文章就介紹到這了,更多相關(guān)python正則表達式匹配多行內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: