Python使用unicodedata實現(xiàn)字符串標(biāo)準(zhǔn)化

更新時間：2023年06月08日 15:37:06 作者：古明地覺的編程教室

這篇文章主要來和大家聊一聊 Python 的一個內(nèi)置模塊：unicodedata，它是專門用來處理 unicode 字符串的，下面就一起來看看它的用法吧

unicodedata.normalize

負(fù)責(zé)對 unicode 字符串進行規(guī)范化處理，因為有些字符看似長度為 1，但其實不是，舉個例子：

s1?=?"é"
s2?=?"e?"
s3?=?"e\u0301"
print(len(s1))??#?1
print(len(s2))??#?2
print(len(s3))??#?2
print(s1?==?s2)??#?False
print(s2?==?s3)??#?True
print(s1,?s2,?s3)??#?é?e??e?

相信打印結(jié)果會讓你感到詫異，s2 的長度看似為 1，但其實是 2，這是因為 Python3 的字符串存儲的其實是碼位序列。而某些字符可以有多種表示方式，比如 e?：

預(yù)組合字符：代碼中的 s1，這是一個單一的 unicode 碼點，表示帶有重音符號的小寫字母 e；
分解序列：代碼中的 s3，將小寫字母 e 和重音符號（\u0301）分開，使用兩個 unicode 碼點表示，而 s2 雖然看起來和 s1 一樣，但它的表示方式和 s3 是相同的；

#?兩個碼點
s?=?"e?"
print(s[0]?==?"e")??#?True
print(s[1]?==?"\u0301")??#?True

這種情況直接處理的話會比較頭疼，所以 normalize 的作用就是規(guī)范化，讓多個碼點表示的字符使用一個碼點表示。規(guī)范化前后的字符串長得一樣，但是長度變短了。

import?unicodedata
s1?=?"lové"
s2?=?"love?"
s3?=?"love\u0301"
print(s1?==?s2?==?s3)??#?False
print(len(s1),?len(s2),?len(s3))??#?4?5?5
#?對字符串進行規(guī)范化，第一個參數(shù)是規(guī)范化方式，第二個參數(shù)是字符串
#?"NFC"：將多個碼點表示的字符替換為等效的單個碼點形式
normalize_s2?=?unicodedata.normalize("NFC",?s2)
normalize_s3?=?unicodedata.normalize("NFC",?s3)
print(
????s1?==?normalize_s2?==?normalize_s3
)??#?True
print(s2,?normalize_s2)??#?love??lové
print(len(s2),?len(normalize_s2))??#?5?4

規(guī)范化方式除了 NFC 之外還有 NFD，它和 NFC 是相反的，表示讓一個碼點表示的字符使用多個碼點表示。

import?unicodedata
s1?=?"lové"
s2?=?"love?"
s3?=?"love\u0301"
print(len(s1),?len(s2),?len(s3))??#?4?5?5
#?s1?中的?é?使用一個碼點表示，這里讓它變成等效的兩個碼點
normalize_s1?=?unicodedata.normalize("NFD",?s1)
print(len(normalize_s1))??#?5
print(s1?==?s3)??#?False
print(normalize_s1?==?s3)??#?True

因此這就是 unicode 的規(guī)范化處理，因為有的 unicode 字符可以有多種表示方式，可以是一個碼點，也可以是兩個碼點，但它們長得都一樣。NFC 是讓兩個碼點表示的字符使用一個碼點表示，NFD 是讓一個碼點表示的字符使用兩個碼點表示。

但需要注意：如果字符只有一種表示方式，那么規(guī)范化前后的結(jié)果是一樣的。

import?unicodedata
#?這六個字符都只有一種表示方式
s?=?"satori"
nfc_s?=?unicodedata.normalize("NFC",?s)
nfd_s?=?unicodedata.normalize("NFD",?s)
print(s?==?nfc_s?==?nfd_s)??#?True
#?這個?emoji?同樣只有一種表示方式，并且占用兩個碼點
s?=?"??"
nfc_s?=?unicodedata.normalize("NFC",?s)
nfd_s?=?unicodedata.normalize("NFD",?s)
print(s?==?nfc_s?==?nfd_s)??#?True

所以 NFC 相當(dāng)于組合，NFD 相當(dāng)于分解。

然后除了 NFC 和 NFD 之外，還有 NFKC 和 NFKD，它們有什么區(qū)別呢？用大白話解釋就是：

NFC 和 NFD 標(biāo)準(zhǔn)化前后的字符，雖然占用的碼點不同，但起碼長得一樣，兩者在外觀上是等價的。這種替換方式也被稱為標(biāo)準(zhǔn)替換。
而 NFKC 和 NFKD 則是兼容替換，它們更關(guān)注文字語義的表達。

import?unicodedata
print(unicodedata.normalize("NFKD",?"?"))??#?株式會社
print(unicodedata.normalize("NFKD",?"?"))??#?中
print(
????unicodedata.normalize("NFKD",?"①②③④⑤")
)??#?12345
#?上述幾個字符串如果使用?NFC、NFD?標(biāo)準(zhǔn)化，那么結(jié)果還是原來的樣子
#?而?'?'?和?'株式會社'?顯然不是一個東西，它們長的都不一樣
#?所以這兩者不可能是同一個字符的不同表達形式
#?但從人類的思維來說，這兩者就是一個東西，在語義上是等價的
#?所以?NFKD?是兼容替換，說白了就是按照語義替換
#?比如全角轉(zhuǎn)半角，組合字符分解成多個獨立字符
comma1?=?"，"
comma2?=?","
print(
????unicodedata.normalize("NFD",?comma1)?==?comma2
)??#?False
print(
????unicodedata.normalize("NFKD",?comma1)?==?comma2
)??#?True
#?comma1?是中文的逗號，comma2?是英文的逗號
#?采用?NFD?標(biāo)準(zhǔn)化的結(jié)果和之前一樣，因為全角和半角壓根不是一個東西
#?但對于人類而言，一眼就知道它們都是逗號
#?所以?NFKD?標(biāo)準(zhǔn)化之后，會將中文逗號轉(zhuǎn)成英文逗號

所以這幾種規(guī)范化方式區(qū)別如下：

NFC：某些字符可以有多種表達方式，將多個碼點表示的字符轉(zhuǎn)成使用一個碼點表示，字符在替換前后的外觀是一樣的；
NFD：和 NFC 相反，將一個碼點表示的字符轉(zhuǎn)成使用多個碼點表示，字符在替換前后的外觀也是一樣的；
NFKD：按照語義對字符進行兼容分解（全角轉(zhuǎn)半角，組合字符分解），前后外觀會發(fā)生變化，但現(xiàn)實語義不變。比如 ? 和中，火和火；
NFKC：NFKD 是兼容分解，直接就完事了，而 NFKC 還會進行組合；

大部分情況下，我們只需要使用 NFC 規(guī)范化即可，而 NFKD 在處理帶圓圈的數(shù)字的時候也會使用。

unicodedata.category

該函數(shù)可以返回一個字符的類別，而類別有以下幾種：

'Lu'：'大寫字母（Letter,?uppercase）'
'Ll'：'小寫字母（Letter,?lowercase）'
'Lt'：'標(biāo)題大小寫字母（Letter,?titlecase）'
'Lm'：'修飾字母（Letter,?modifier）'
'Lo'：'其他字母（Letter,?other）'
'Mn'：'非間斷標(biāo)記（Mark,?nonspacing）'
'Mc'：'間斷標(biāo)記（Mark,?spacing?combining）'
'Me'：'封閉標(biāo)記（Mark,?enclosing）'
'Nd'：'十進制數(shù)字（Number,?decimal?digit）'
'Nl'：'字母數(shù)字（Number,?letter）'
'No'：'其他數(shù)字（Number,?other）'
'Pc'：'連接符號（Punctuation,?connector）'
'Pd'：'破折號符號（Punctuation,?dash）'
'Ps'：'開放的標(biāo)點符號（Punctuation,?open）'
'Pe'：'關(guān)閉的標(biāo)點符號（Punctuation,?close）'
'Pi'：'初引號（Punctuation,?initial?quote）'
'Pf'：'末引號（Punctuation,?final?quote）'
'Po'：'其他標(biāo)點符號（Punctuation,?other）'
'Sm'：'數(shù)學(xué)符號（Symbol,?math）'
'Sc'：'貨幣符號（Symbol,?currency）'
'Sk'：'修飾符號（Symbol,?modifier）'
'So'：'其他符號（Symbol,?other）'
'Zs'：'空格符號（Separator,?space）'
'Zl'：'分行符（Separator,?line）'
'Zp'：'分段符（Separator,?paragraph）'
'Cc'：'控制字符（Other,?control）'
'Cf'：'格式字符（Other,?format）'
'Cs'：'代理字符（Other,?surrogate）'
'Co'：'私用字符（Other,?private?use）'
'Cn'：'未分配字符（Other,?not?assigned）'

舉個例子：

import?unicodedata

print(unicodedata.category('A'))??#?Lu
print(unicodedata.category('a'))??#?Ll
print(unicodedata.category('1'))??#?Nd
print(unicodedata.category('$'))??#?Sc
print(unicodedata.category('?'))??#?Zs

'A' 是大寫字母，所以它的類別是 'Lu'；'a' 是小寫字母，所以它的類別是 'Ll'；'1' 是十進制數(shù)字，所以它的類別是 'Nd'；'$' 是貨幣符號，所以它的類別是 'Sc'；' ' 是空格符號，所以它的類別是 'Zs'。

unicodedata.lookup

有些 unicode 字符是有名字的，可以根據(jù)它的名字查找相應(yīng)的字符。

import?unicodedata
print(
????unicodedata.lookup("LATIN?SMALL?LETTER?A"),
????unicodedata.lookup("COPYRIGHT?SIGN"),
????unicodedata.lookup("PEACH"),
)??#?a?????

如果給定的名字不是一個有效的 Unicode 字符名，那么會拋出 KeyError。

unicodedata.name

和 lookup 功能相反，負(fù)責(zé)返回字符的名稱。

import?unicodedata
print(unicodedata.name("z"))??#?LATIN?SMALL?LETTER?Z
print(unicodedata.name("@"))??#?COMMERCIAL?AT
print(unicodedata.name("??"))??#?PEACH

比較簡單，如果字符沒有名稱，則拋出 ValueError，或者也可以指定一個默認(rèn)值。

unicodedata.numeric

將 unicode 字符轉(zhuǎn)成等效的數(shù)值，如果無法轉(zhuǎn)換則返回默認(rèn)值（沒有則拋出 ValueError）。

import?unicodedata

print(unicodedata.numeric("零"))??#?0.0
print(unicodedata.numeric("〇"))??#?0.0
print(unicodedata.numeric("一"))??#?1.0
print(unicodedata.numeric("貳"))??#?2.0
print(unicodedata.numeric("叁"))??#?3.0
print(unicodedata.numeric("四"))??#?4.0
print(unicodedata.numeric("伍"))??#?5.0
print(unicodedata.numeric("⑥"))??#?6.0
print(unicodedata.numeric("漆"))??#?7.0
print(unicodedata.numeric("捌"))??#?8.0
print(unicodedata.numeric("玖"))??#?9.0
print(unicodedata.numeric("拾"))??#?10.0


text?=?"加?v??壹捌⑤壹零貳??捌⑥捌〇②，看?Python??"

def?chr_to_num(char):
????try:
????????return?str(unicodedata.numeric(char))[0]
????except?ValueError:
????????return?char

print(
????"".join(map(chr_to_num,?text))
)??#?加?v??185102??86802，看?Python??

當(dāng)然還有幾個函數(shù)沒有說，個人覺得用不上，這里面最有用的應(yīng)該就是 normalize 函數(shù)了，更多內(nèi)容可以參考官網(wǎng)。

到此這篇關(guān)于Python使用unicodedata實現(xiàn)字符串標(biāo)準(zhǔn)化的文章就介紹到這了,更多相關(guān)Python unicodedata內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章: