正則表達(dá)式的概念介紹和python實(shí)踐應(yīng)用詳解

更新時(shí)間：2025年10月24日 09:53:29 作者：mftang

在Python中正則表達(dá)式是一種強(qiáng)大的文本處理工具,它允許你根據(jù)特定的模式匹配字符串中的子串,這篇文章主要介紹了正則表達(dá)式的概念和python實(shí)踐應(yīng)用的相關(guān)資料,文中通過代碼介紹的非常詳細(xì),需要的朋友可以參考下

概述

本文主要介紹正則表達(dá)式的定義和基本應(yīng)用方法，正則表達(dá)式是一個(gè)強(qiáng)大的工具，熟練掌握后可以極大地提高文本處理的效率。

1 正則表達(dá)式的概念

正則表達(dá)式（Regular Expression）是一種用于匹配字符串中字符組合的模式。在編程中，正則表達(dá)式被用來進(jìn)行字符串的搜索、替換、提取等操作。

1.1 正則表達(dá)式基本語法

1) 普通字符

大多數(shù)字符（字母、數(shù)字、漢字等）會(huì)直接匹配它們自身。例如，正則表達(dá)式hello會(huì)匹配字符串中的"hello"。

2) 元字符

元字符是正則表達(dá)式中具有特殊含義的字符，包括：

.：匹配除換行符以外的任意字符。
^：匹配字符串的開始。
$：匹配字符串的結(jié)束。
*：匹配前面的子表達(dá)式零次或多次。
+：匹配前面的子表達(dá)式一次或多次。
?：匹配前面的子表達(dá)式零次或一次。
{n}：匹配前面的子表達(dá)式恰好n次。
{n,}：匹配前面的子表達(dá)式至少n次。
{n,m}：匹配前面的子表達(dá)式至少n次，至多m次。
[]：字符集合，匹配所包含的任意一個(gè)字符。
|：或，匹配左右任意一個(gè)表達(dá)式。
()：分組，將多個(gè)字符組合成一個(gè)單元，可用于后續(xù)引用。

3) 轉(zhuǎn)義字符

如果要匹配元字符本身，需要使用反斜杠\進(jìn)行轉(zhuǎn)義。例如，要匹配字符.，需要使用\.。

4) 預(yù)定義字符集

\d：匹配任意數(shù)字，等價(jià)于[0-9]。
\D：匹配任意非數(shù)字，等價(jià)于[^0-9]。
\w：匹配字母、數(shù)字、下劃線，等價(jià)于[a-zA-Z0-9_]。
\W：匹配非字母、數(shù)字、下劃線，等價(jià)于[^a-zA-Z0-9_]。
\s：匹配任意空白字符，包括空格、制表符、換行符等。
\S：匹配任意非空白字符。

1.2 正則表達(dá)式在Python中的使用

Python通過re模塊提供正則表達(dá)式功能。常用函數(shù)包括：

1) re.match()

從字符串的起始位置匹配一個(gè)模式，如果匹配成功，返回一個(gè)匹配對(duì)象，否則返回None。

2) re.search()

掃描整個(gè)字符串并返回第一個(gè)成功的匹配。

3) re.findall()

在字符串中找到正則表達(dá)式所匹配的所有子串，并返回一個(gè)列表。

4) re.finditer()

和re.findall()類似，但返回一個(gè)迭代器，每個(gè)元素是一個(gè)匹配對(duì)象。

5) re.sub()

用于替換字符串中的匹配項(xiàng)。

6) re.split()

按照能夠匹配的子串將字符串分割后返回列表。

2 正則表達(dá)式應(yīng)用

2.1 基本語法范例

源代碼

import re

# 基本匹配示例
text = "Hello, my email is example@email.com and phone is 123-456-7890"

# 查找郵箱
email_pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
emails = re.findall(email_pattern, text)
print("Emails found:", emails)

# 查找電話號(hào)碼
phone_pattern = r'\d{3}-\d{3}-\d{4}'
phones = re.findall(phone_pattern, text)
print("Phones found:", phones)

運(yùn)行結(jié)果：

Emails found: ['example@email.com']
Phones found: ['123-456-7890']

Process finished with exit code 0

2.2 元字符詳解

1）字符類

源代碼

import re

def demonstrate_character_classes():
    """演示字符類"""
    text = "abc123 XYZ!@#"

    patterns = {
        r'\d': '數(shù)字',  # [0-9]
        r'\D': '非數(shù)字',  # [^0-9]
        r'\w': '單詞字符',  # [a-zA-Z0-9_]
        r'\W': '非單詞字符',  # [^a-zA-Z0-9_]
        r'\s': '空白字符',  # [ \t\n\r\f\v]
        r'\S': '非空白字符',  # [^ \t\n\r\f\v]
        r'[a-z]': '小寫字母',  # 自定義字符類
        r'[^0-9]': '非數(shù)字',  # 否定字符類
    }

    for pattern, description in patterns.items():
        matches = re.findall(pattern, text)
        print(f"{description} ({pattern}): {matches}")

demonstrate_character_classes()

運(yùn)行結(jié)果

數(shù)字 (\d): ['1', '2', '3']
非數(shù)字 (\D): ['a', 'b', 'c', ' ', 'X', 'Y', 'Z', '!', '@', '#']
單詞字符 (\w): ['a', 'b', 'c', '1', '2', '3', 'X', 'Y', 'Z']
非單詞字符 (\W): [' ', '!', '@', '#']
空白字符 (\s): [' ']
非空白字符 (\S): ['a', 'b', 'c', '1', '2', '3', 'X', 'Y', 'Z', '!', '@', '#']
小寫字母 ([a-z]): ['a', 'b', 'c']
非數(shù)字 ([^0-9]): ['a', 'b', 'c', ' ', 'X', 'Y', 'Z', '!', '@', '#']

2）量詞

源代碼

def demonstrate_quantifiers():
    """演示量詞"""
    text = "a aa aaa aaaa b bb bbb"

    patterns = {
        r'a?': '0或1個(gè)a',
        r'a+': '1個(gè)或多個(gè)a',
        r'a*': '0個(gè)或多個(gè)a',
        r'a{2}': '恰好2個(gè)a',
        r'a{2,}': '2個(gè)或更多a',
        r'a{2,4}': '2到4個(gè)a',
    }

    for pattern, description in patterns.items():
        matches = re.findall(pattern, text)
        print(f"{description} ({pattern}): {matches}")

demonstrate_quantifiers()

運(yùn)行結(jié)果

0或1個(gè)a (a?): ['a', '', 'a', 'a', '', 'a', 'a', 'a', '', 'a', 'a', 'a', 'a', '', '', '', '', '', '', '', '', '', '']
1個(gè)或多個(gè)a (a+): ['a', 'aa', 'aaa', 'aaaa']
0個(gè)或多個(gè)a (a*): ['a', '', 'aa', '', 'aaa', '', 'aaaa', '', '', '', '', '', '', '', '', '', '']
恰好2個(gè)a (a{2}): ['aa', 'aa', 'aa', 'aa']
2個(gè)或更多a (a{2,}): ['aa', 'aaa', 'aaaa']
2到4個(gè)a (a{2,4}): ['aa', 'aaa', 'aaaa']

3) 錨點(diǎn)和邊界

def demonstrate_anchors():
    """演示錨點(diǎn)"""
    lines = [
        "start of line",
        "middle of text",
        "end of line"
    ]

    # 行首匹配
    start_pattern = r'^s\w+'
    # 行尾匹配
    end_pattern = r'\w+line$'
    # 單詞邊界
    word_boundary = r'\bof\b'

    for line in lines:
        start_match = re.search(start_pattern, line)
        end_match = re.search(end_pattern, line)
        word_match = re.search(word_boundary, line)

        print(f"Line: '{line}'")
        print(f"  Start match: {start_match.group() if start_match else 'None'}")
        print(f"  End match: {end_match.group() if end_match else 'None'}")
        print(f"  Word boundary: {word_match.group() if word_match else 'None'}")
        print()

demonstrate_anchors()

運(yùn)行結(jié)果

Line: 'start of line'
Start match: start
End match: None
Word boundary: of

Line: 'middle of text'
Start match: None
End match: None
Word boundary: of

Line: 'end of line'
Start match: None
End match: None
Word boundary: of

2.3 分組和捕獲

1）分組類型

源代碼

def demonstrate_groups():
    """演示分組"""
    text = "John Doe, Jane Smith, Bob Johnson"

    # 捕獲分組
    capture_pattern = r'(\w+)\s(\w+)'
    capture_matches = re.findall(capture_pattern, text)
    print("Capture groups:", capture_matches)

    # 非捕獲分組
    non_capture_pattern = r'(?:\w+)\s(\w+)'
    non_capture_matches = re.findall(non_capture_pattern, text)
    print("Non-capture groups (only last names):", non_capture_matches)

    # 命名分組
    named_pattern = r'(?P<first>\w+)\s(?P<last>\w+)'
    named_matches = re.finditer(named_pattern, text)

    print("Named groups:")
    for match in named_matches:
        print(f"  Full: {match.group()}")
        print(f"  First: {match.group('first')}, Last: {match.group('last')}")


demonstrate_groups()

運(yùn)行結(jié)果

Capture groups: [('John', 'Doe'), ('Jane', 'Smith'), ('Bob', 'Johnson')]
Non-capture groups (only last names): ['Doe', 'Smith', 'Johnson']
Named groups:
Full: John Doe
First: John, Last: Doe
Full: Jane Smith
First: Jane, Last: Smith
Full: Bob Johnson
First: Bob, Last: Johnson

2) 回溯引用

源代碼

def demonstrate_backreferences():
    """演示回溯引用"""
    text = "hello hello world world test test"

    # 查找重復(fù)單詞
    duplicate_pattern = r'\b(\w+)\s+\1\b'
    duplicates = re.findall(duplicate_pattern, text)
    print("Duplicate words:", duplicates)

    # 在替換中使用回溯引用
    html_text = "<b>bold</b> and <i>italic</i>"
    replacement_pattern = r'<(\w+)>(.*?)</\1>'
    replaced = re.sub(replacement_pattern, r'[\1]: \2', html_text)
    print("After replacement:", replaced)

demonstrate_backreferences()

運(yùn)行結(jié)果

Duplicate words: ['hello', 'world', 'test']
After replacement: [b]: bold and [i]: italic

2.4 高級(jí)特性

1）前瞻和后顧

源代碼

def demonstrate_lookaround():
    """演示前后查找"""
    text = "apple $10 orange $20 banana $30"
    
    # 正向前瞻 - 匹配后面跟著$的數(shù)字
    lookahead_pattern = r'\d+(?=\$)'
    lookahead_matches = re.findall(lookahead_pattern, text)
    print("Positive lookahead (numbers before $):", lookahead_matches)
    
    # 負(fù)向前瞻 - 匹配后面不跟著$的數(shù)字
    negative_lookahead_pattern = r'\d+(?!\$)'
    negative_matches = re.findall(negative_lookahead_pattern, text)
    print("Negative lookahead:", negative_matches)
    
    # 正向后顧 - 匹配前面有$的數(shù)字
    lookbehind_pattern = r'(?<=\$)\d+'
    lookbehind_matches = re.findall(lookbehind_pattern, text)
    print("Positive lookbehind (numbers after $):", lookbehind_matches)
    
    # 負(fù)向后顧 - 匹配前面沒有$的數(shù)字
    negative_lookbehind_pattern = r'(?<!\$)\d+'
    negative_lookbehind_matches = re.findall(negative_lookbehind_pattern, text)
    print("Negative lookbehind:", negative_lookbehind_matches)

demonstrate_lookaround()

運(yùn)行結(jié)果

Positive lookahead (numbers before $): []
Negative lookahead: ['10', '20', '30']
Positive lookbehind (numbers after $): ['10', '20', '30']
Negative lookbehind: ['0', '0', '0']

2) 條件匹配

def demonstrate_conditional_matching():
    """演示條件匹配"""
    text = """
    <div>content</div>
    <span>other content</span>
    <div class="special">special content</div>
    """
    
    # 條件匹配：如果標(biāo)簽有class="special"，則匹配特殊模式
    # 這個(gè)例子比較復(fù)雜，實(shí)際中可能需要分步處理
    pattern = r'<(\w+)(?:\s+class="special")?>(.*?)</\1>'
    matches = re.findall(pattern, text)
    
    print("Conditional matches:")
    for tag, content in matches:
        print(f"  Tag: {tag}, Content: '{content.strip()}'")

demonstrate_conditional_matching()

運(yùn)行結(jié)果

Conditional matches:
Tag: div, Content: 'content'
Tag: span, Content: 'other content'
Tag: div, Content: 'special content'

3 Python re模塊

3.1 主要函數(shù)功能演示

測(cè)試代碼如下：

def demonstrate_re_functions():
    """演示re模塊主要函數(shù)"""
    text = "The quick brown fox jumps over the lazy dog. The dog was lazy."
    
    # 1. re.search() - 查找第一個(gè)匹配
    first_match = re.search(r'\bfox\b', text)
    print(f"re.search(): {first_match.group() if first_match else 'Not found'}")
    
    # 2. re.match() - 從字符串開始匹配
    start_match = re.match(r'^The', text)
    print(f"re.match(): {start_match.group() if start_match else 'Not found'}")
    
    # 3. re.findall() - 查找所有匹配
    all_matches = re.findall(r'\b\w{3}\b', text)  # 所有3字母單詞
    print(f"re.findall() 3-letter words: {all_matches}")
    
    # 4. re.finditer() - 返回迭代器
    print("re.finditer():")
    for match in re.finditer(r'\b\w{4}\b', text):  # 所有4字母單詞
        print(f"  Found '{match.group()}' at position {match.start()}-{match.end()}")
    
    # 5. re.sub() - 替換
    replaced = re.sub(r'\bdog\b', 'cat', text)
    print(f"re.sub() result: {replaced}")
    
    # 6. re.split() - 分割
    split_result = re.split(r'\s+', text)  # 按空白字符分割
    print(f"re.split() first 5 words: {split_result[:5]}")

demonstrate_re_functions()

運(yùn)行結(jié)果：

re.search(): fox
re.match(): The
re.findall() 3-letter words: ['The', 'fox', 'the', 'dog', 'The', 'dog', 'was']
re.finditer():
Found 'over' at position 26-30
Found 'lazy' at position 35-39
Found 'lazy' at position 57-61
re.sub() result: The quick brown fox jumps over the lazy cat. The cat was lazy.
re.split() first 5 words: ['The', 'quick', 'brown', 'fox', 'jumps']

3.2 編譯正則表達(dá)式

測(cè)試代碼如下：

def demonstrate_compiled_regex():
    """演示編譯正則表達(dá)式"""
    # 編譯正則表達(dá)式（提高性能，特別是重復(fù)使用時(shí)）
    email_pattern = re.compile(r'''
        \b
        [A-Za-z0-9._%+-]+   # 用戶名
        @                   # @符號(hào)
        [A-Za-z0-9.-]+      # 域名
        \.[A-Z|a-z]{2,}     # 頂級(jí)域名
        \b
    ''', re.VERBOSE)
    
    text = """
    Contact us at: 
    john.doe@company.com, 
    jane_smith123@sub.domain.co.uk,
    invalid-email@com
    """
    
    # 使用編譯后的模式
    valid_emails = email_pattern.findall(text)
    print("Valid emails:", valid_emails)
    
    # 編譯時(shí)使用多個(gè)標(biāo)志
    multi_flag_pattern = re.compile(r'^hello', re.IGNORECASE | re.MULTILINE)
    multi_text = "Hello world\nhello there\nHELLO everyone"
    multi_matches = multi_flag_pattern.findall(multi_text)
    print("Multi-flag matches:", multi_matches)

demonstrate_compiled_regex()

運(yùn)行結(jié)果：

Valid emails: ['john.doe@company.com', 'jane_smith123@sub.domain.co.uk']
Multi-flag matches: ['Hello', 'hello', 'HELLO']

3.3 常用模式集合

源代碼文件

class CommonRegexPatterns:
    """常用正則表達(dá)式模式"""
    
    # 郵箱驗(yàn)證
    EMAIL = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    
    # 手機(jī)號(hào)（中國(guó)）
    PHONE_CN = r'^1[3-9]\d{9}$'
    
    # URL
    URL = r'^https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+'
    
    # IP地址
    IP_V4 = r'^(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
    IP_V6 = r'^(?:[A-F0-9]{1,4}:){7}[A-F0-9]{1,4}$'
    
    # 身份證號(hào)（中國(guó)）
    ID_CARD = r'^[1-9]\d{5}(18|19|20)\d{2}((0[1-9])|(1[0-2]))(([0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]$'
    
    # 日期 (YYYY-MM-DD)
    DATE = r'^\d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12][0-9]|3[01])$'
    
    # 時(shí)間 (HH:MM:SS)
    TIME = r'^([01]?[0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]$'
    
    # 漢字
    CHINESE_CHAR = r'^[\u4e00-\u9fa5]+$'
    
    # 數(shù)字（整數(shù)或小數(shù)）
    NUMBER = r'^-?\d+(?:\.\d+)?$'

def validate_with_patterns():
    """使用常用模式驗(yàn)證"""
    test_cases = {
        'email': [
            'test@example.com',
            'invalid-email',
            'user@domain.co.uk'
        ],
        'phone': [
            '13812345678',
            '12345678901',
            '19876543210'
        ],
        'date': [
            '2023-12-25',
            '2023-13-01',
            '1999-02-29'
        ]
    }
    
    patterns = {
        'email': CommonRegexPatterns.EMAIL,
        'phone': CommonRegexPatterns.PHONE_CN,
        'date': CommonRegexPatterns.DATE
    }
    
    for data_type, cases in test_cases.items():
        pattern = patterns[data_type]
        print(f"\nValidating {data_type}:")
        for case in cases:
            is_valid = bool(re.match(pattern, case))
            print(f"  '{case}': {'? Valid' if is_valid else '? Invalid'}")

validate_with_patterns()

運(yùn)行結(jié)果如下：

Validating email:
'test@example.com': ? Valid
'invalid-email': ? Invalid
'user@domain.co.uk': ? Valid

Validating phone:
'13812345678': ? Valid
'12345678901': ? Invalid
'19876543210': ? Valid

Validating date:
'2023-12-25': ? Valid
'2023-13-01': ? Invalid
'1999-02-29': ? Valid

3.4 性能優(yōu)化技巧

源代碼文件

import time

def demonstrate_performance():
    """演示性能優(yōu)化"""

    # 測(cè)試文本
    large_text = "test " * 10000 + "target" + " test" * 10000

    # 方法1：直接使用re函數(shù)（每次編譯）
    start_time = time.time()
    for _ in range(100):
        re.search(r'target', large_text)
    direct_time = time.time() - start_time

    # 方法2：使用編譯后的模式
    compiled_pattern = re.compile(r'target')
    start_time = time.time()
    for _ in range(100):
        compiled_pattern.search(large_text)
    compiled_time = time.time() - start_time

    print(f"Direct search time: {direct_time:.4f}s")
    print(f"Compiled search time: {compiled_time:.4f}s")
    print(f"Performance improvement: {direct_time / compiled_time:.2f}x")

    # 避免災(zāi)難性回溯
    print("\nAvoiding catastrophic backtracking:")

    # 不好的模式（可能引起災(zāi)難性回溯）
    bad_pattern = r'(a+)+b'
    # 好的模式
    good_pattern = r'a+b'

    test_string = "aaaaaaaaaaaaaaaaaaaaaaaa!"

    try:
        start_time = time.time()
        re.match(bad_pattern, test_string)
        bad_time = time.time() - start_time
        print(f"Bad pattern time: {bad_time:.4f}s")
    except:
        print("Bad pattern caused timeout/error")

    start_time = time.time()
    re.match(good_pattern, test_string)
    good_time = time.time() - start_time
    print(f"Good pattern time: {good_time:.4f}s")


demonstrate_performance()

運(yùn)行結(jié)果如下：

Direct search time: 0.0091s
Compiled search time: 0.0060s
Performance improvement: 1.53x

Avoiding catastrophic backtracking:
Bad pattern time: 0.8640s
Good pattern time: 0.0000s

4 應(yīng)用實(shí)踐

4.1 解析字符demo

源代碼文件

def regex_best_practices():
    """正則表達(dá)式最佳實(shí)踐"""
    
    # 1. 使用原始字符串
    print("1. 使用原始字符串:")
    bad_string = "\\section"  # 需要轉(zhuǎn)義反斜杠
    good_string = r"\section"  # 原始字符串，不需要轉(zhuǎn)義
    
    print(f"   Bad: {bad_string}")
    print(f"   Good: {good_string}")
    
    # 2. 編譯重復(fù)使用的模式
    print("\n2. 編譯重復(fù)使用的模式:")
    # 不好的做法：每次重新編譯
    # 好的做法：預(yù)先編譯
    
    # 3. 使用非貪婪匹配
    print("\n3. 使用非貪婪匹配:")
    html_text = "<div>content</div><div>more</div>"
    
    greedy_pattern = r'<div>.*</div>'  # 貪婪匹配
    non_greedy_pattern = r'<div>.*?</div>'  # 非貪婪匹配
    
    greedy_match = re.search(greedy_pattern, html_text)
    non_greedy_matches = re.findall(non_greedy_pattern, html_text)
    
    print(f"   Greedy: {greedy_match.group() if greedy_match else 'None'}")
    print(f"   Non-greedy: {non_greedy_matches}")
    
    # 4. 使用字符類而不是選擇分支
    print("\n4. 使用字符類:")
    bad_pattern = r'[0123456789]'  # 冗長(zhǎng)
    good_pattern = r'[0-9]'  # 簡(jiǎn)潔
    better_pattern = r'\d'  # 更好
    
    test_text = "abc123"
    print(f"   Bad pattern matches: {re.findall(bad_pattern, test_text)}")
    print(f"   Good pattern matches: {re.findall(good_pattern, test_text)}")
    print(f"   Better pattern matches: {re.findall(better_pattern, test_text)}")

regex_best_practices()

運(yùn)行結(jié)果：

1. 使用原始字符串:
Bad: \section
Good: \section

2. 編譯重復(fù)使用的模式:

3. 使用非貪婪匹配:
Greedy: <div>content</div><div>more</div>
Non-greedy: ['<div>content</div>', '<div>more</div>']

4. 使用字符類:
Bad pattern matches: ['1', '2', '3']
Good pattern matches: ['1', '2', '3']
Better pattern matches: ['1', '2', '3']

4.2 日志分析

源代碼文件

def log_analysis_example():
    """日志分析示例"""
    
    log_data = """
    2023-12-01 10:30:15 INFO User john_doe logged in from 192.168.1.100
    2023-12-01 10:35:22 ERROR Database connection failed
    2023-12-01 10:40:05 WARNING High memory usage detected (85%)
    2023-12-01 10:45:30 INFO User jane_smith accessed /api/data
    2023-12-01 10:50:17 ERROR File not found: /var/www/image.jpg
    """
    
    # 解析日志條目
    log_pattern = r'(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) (\w+) (.*)'
    
    print("Log Analysis:")
    print("-" * 50)
    
    for match in re.finditer(log_pattern, log_data):
        timestamp, level, message = match.groups()
        
        # 根據(jù)日志級(jí)別添加顏色
        if level == 'ERROR':
            level_display = f"\033[91m{level}\033[0m"  # 紅色
        elif level == 'WARNING':
            level_display = f"\033[93m{level}\033[0m"  # 黃色
        else:
            level_display = f"\033[92m{level}\033[0m"  # 綠色
        
        print(f"{timestamp} {level_display} {message}")
    
    # 統(tǒng)計(jì)日志級(jí)別
    level_pattern = r'\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} (\w+)'
    levels = re.findall(level_pattern, log_data)
    
    from collections import Counter
    level_counts = Counter(levels)
    
    print("\nLog Level Statistics:")
    for level, count in level_counts.items():
        print(f"  {level}: {count}")

log_analysis_example()

運(yùn)行結(jié)果：

Log Analysis:
--------------------------------------------------
2023-12-01 10:30:15 INFO User john_doe logged in from 192.168.1.100
2023-12-01 10:35:22 ERROR Database connection failed
2023-12-01 10:40:05 WARNING High memory usage detected (85%)
2023-12-01 10:45:30 INFO User jane_smith accessed /api/data
2023-12-01 10:50:17 ERROR File not found: /var/www/image.jpg

Log Level Statistics:
INFO: 2
ERROR: 2
WARNING: 1

4.3 數(shù)據(jù)提取和清洗

源代碼文件

def data_cleaning_example():
    """數(shù)據(jù)清洗示例"""

    dirty_data = """
    Names: John Doe, Jane Smith, Bob Johnson
    Emails: john@test.com, jane@example.org, invalid-email
    Phones: 123-456-7890, 555.123.4567, (999) 888-7777, invalid-phone
    Dates: 2023/12/01, 01-12-2023, 2023.12.01, invalid-date
    """

    # 定義清洗規(guī)則
    cleaning_rules = {
        'emails': CommonRegexPatterns.EMAIL,
        'phones': r'\b\d{3}[-.)]\d{3}[-.]\d{4}\b',
        'dates': r'\b\d{4}[-/.]\d{2}[-/.]\d{2}\b',
        'names': r'\b[A-Z][a-z]+ [A-Z][a-z]+\b'
    }

    print("Data Cleaning Results:")
    print("-" * 40)

    for data_type, pattern in cleaning_rules.items():
        matches = re.findall(pattern, dirty_data)
        print(f"{data_type.capitalize()}: {matches}")


data_cleaning_example()

運(yùn)行結(jié)果：

Data Cleaning Results:
----------------------------------------
Emails: []
Phones: ['123-456-7890', '555.123.4567']
Dates: ['2023/12/01', '2023.12.01']
Names: ['John Doe', 'Jane Smith', 'Bob Johnson']

總結(jié)

到此這篇關(guān)于正則表達(dá)式的概念介紹和python實(shí)踐應(yīng)用的文章就介紹到這了,更多相關(guān)正則表達(dá)式介紹和python應(yīng)用內(nèi)容請(qǐng)搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: