Python實現敏感詞過濾的五種方法
1、replace替換
replace就是最簡單的字符串替換,當一串字符串中有可能會出現的敏感詞時,我們直接使用相應的replace方法用*替換出敏感詞即可。
缺點:
文本和敏感詞少的時候還可以,多的時候效率就比較差了。
示例代碼:
text = '我是一個來自星星的超人,具有超人本領!'
text = text.replace("超人", '*' * len("超人")).replace("星星", '*' * len("星星"))
print(text) # 我是一個來自***的***,具有***本領!運行結果:

如果是多個敏感詞可以用列表進行逐一替換。
示例代碼:
text = '我是一個來自星星的超人,具有超人本領!'
words = ['超人', '星星']
for word in words:
text = text.replace(word, '*' * len(word))
print(text) # 我是一個來自***的***,具有***本領!運行效果:

2、正則表達式
使用正則表達式是一種簡單而有效的方法,可以快速地匹配敏感詞并進行過濾。在這里我們主要是使用“|”來進行匹配,“|”的意思是從多個目標字符串中選擇一個進行匹配。
示例代碼:
import re
def filter_words(text, words):
pattern = '|'.join(words)
return re.sub(pattern, '***', text)
if __name__ == '__main__':
text = '我是一個來自星星的超人,具有超人本領!'
words = ['超人', '星星']
res = filter_words(text, words)
print(res) # 我是一個來自***的***,具有***本領!運行結果:

3、使用ahocorasick第三方庫
ahocorasick庫安裝:
pip install pyahocorasick

示例代碼:
import ahocorasick
def filter_words(text, words):
A = ahocorasick.Automaton()
for index, word in enumerate(words):
A.add_word(word, (index, word))
A.make_automaton()
result = []
for end_index, (insert_order, original_value) in A.iter(text):
start_index = end_index - len(original_value) + 1
result.append((start_index, end_index))
for start_index, end_index in result[::-1]:
text = text[:start_index] + '*' * (end_index - start_index + 1) + text[end_index + 1:]
return text
if __name__ == '__main__':
text = '我是一個來自星星的超人,具有超人本領!'
words = ['超人', '星星']
res = filter_words(text, words)
print(res) # 我是一個來自***的***,具有***本領!運行結果:

4、字典樹
使用字典樹是一種高效的方法,可以快速地匹配敏感詞并進行過濾。
示例代碼:
class TreeNode:
def __init__(self):
self.children = {}
self.is_end = False
class Tree:
def __init__(self):
self.root = TreeNode()
def insert(self, word):
node = self.root
for char in word:
if char not in node.children:
node.children[char] = TreeNode()
node = node.children[char]
node.is_end = True
def search(self, word):
node = self.root
for char in word:
if char not in node.children:
return False
node = node.children[char]
return node.is_end
def filter_words(text, words):
tree = Tree()
for word in words:
tree.insert(word)
result = []
for i in range(len(text)):
node = tree.root
for j in range(i, len(text)):
if text[j] not in node.children:
break
node = node.children[text[j]]
if node.is_end:
result.append((i, j))
for start_index, end_index in result[::-1]:
text = text[:start_index] + '*' * (end_index - start_index + 1) + text[end_index + 1:]
return text
if __name__ == '__main__':
text = '我是一個來自星星的超人,具有超人本領!'
words = ['超人', '星星']
res = filter_words(text, words)
print(res) # 我是一個來自***的***,具有***本領!運行結果:

5、DFA算法
使用DFA算法是一種高效的方法,可以快速地匹配敏感詞并進行過濾。DFA的算法,即Deterministic Finite Automaton算法,翻譯成中文就是確定有窮自動機算法。它的基本思想是基于狀態(tài)轉移來檢索敏感詞,只需要掃描一次待檢測文本,就能對所有敏感詞進行檢測。
示例代碼:
class DFA:
def __init__(self, words):
self.words = words
self.build()
def build(self):
self.transitions = {}
self.fails = {}
self.outputs = {}
state = 0
for word in self.words:
current_state = 0
for char in word:
next_state = self.transitions.get((current_state, char), None)
if next_state is None:
state += 1
self.transitions[(current_state, char)] = state
current_state = state
else:
current_state = next_state
self.outputs[current_state] = word
queue = []
for (start_state, char), next_state in self.transitions.items():
if start_state == 0:
queue.append(next_state)
self.fails[next_state] = 0
while queue:
r_state = queue.pop(0)
for (state, char), next_state in self.transitions.items():
if state == r_state:
queue.append(next_state)
fail_state = self.fails[state]
while (fail_state, char) not in self.transitions and fail_state != 0:
fail_state = self.fails[fail_state]
self.fails[next_state] = self.transitions.get((fail_state, char), 0)
if self.fails[next_state] in self.outputs:
self.outputs[next_state] += ', ' + self.outputs[self.fails[next_state]]
def search(self, text):
state = 0
result = []
for i, char in enumerate(text):
while (state, char) not in self.transitions and state != 0:
state = self.fails[state]
state = self.transitions.get((state, char), 0)
if state in self.outputs:
result.append((i - len(self.outputs[state]) + 1, i))
return result
def filter_words(text, words):
dfa = DFA(words)
result = []
for start_index, end_index in dfa.search(text):
result.append((start_index, end_index))
for start_index, end_index in result[::-1]:
text = text[:start_index] + '*' * (end_index - start_index + 1) + text[end_index + 1:]
return text
if __name__ == '__main__':
text = '我是一個來自星星的超人,具有超人本領!'
words = ['超人', '星星']
res = filter_words(text, words)
print(res) # 我是一個來自***的***,具有***本領!運行結果:

到此這篇關于Python實現敏感詞過濾的五種方法的文章就介紹到這了,更多相關Python敏感詞過濾內容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關文章希望大家以后多多支持腳本之家!
相關文章
ubuntu系統(tǒng)下使用pm2設置nodejs開機自啟動的方法
今天小編就為大家分享一篇ubuntu系統(tǒng)下使用pm2設置nodejs開機自啟動的方法,具有很好的參考價值,希望對大家有所幫助。一起跟隨小編過來看看吧2018-05-05

