快捷導(dǎo)航

python 實(shí)現(xiàn)敏感詞過(guò)濾的方法

更新時(shí)間：2019年01月21日 10:08:57 作者：isoleo

今天小編就為大家分享一篇python 實(shí)現(xiàn)敏感詞過(guò)濾的方法，具有很好的參考價(jià)值，希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧

如下所示：

#!/usr/bin/python2.6  
# -*- coding: utf-8 -*- 
import time 
class Node(object): 
  def __init__(self): 
    self.children = None 
 
# The encode of word is UTF-8 
def add_word(root,word): 
  node = root 
  for i in range(len(word)): 
    if node.children == None: 
      node.children = {} 
      node.children[word[i]] = Node() 
 
    elif word[i] not in node.children: 
      node.children[word[i]] = Node() 
 
    node = node.children[word[i]] 
 
def init(path): 
  root = Node() 
  fp = open(path,'r') 
  for line in fp: 
    line = line[0:-1] 
    #print len(line) 
    #print line 
    #print type(line) 
    add_word(root,line) 
  fp.close() 
  return root 
 
# The encode of word is UTF-8 
# The encode of message is UTF-8 
def is_contain(message, root): 
  for i in range(len(message)): 
    p = root 
    j = i 
    while (j<len(message) and p.children!=None and message[j] in p.children): 
      p = p.children[message[j]] 
      j = j + 1 
 
    if p.children==None: 
      #print '---word---',message[i:j] 
      return True 
   
  return False 
 
 
 
def dfa(): 
  print '----------------dfa-----------' 
  root = init('/tmp/word.txt') 
 
  #message = '不顧' 
  print '***message***',len(message) 
  start_time = time.time() 
  for i in range(1000): 
    res = is_contain(message,root) 
    #print res 
  end_time = time.time() 
  print (end_time - start_time)  
 
def is_contain2(message,word_list): 
  for item in word_list: 
    if message.find(item)!=-1: 
      return True 
  return False 
 
def normal(): 
  print '------------normal--------------' 
  path = '/tmp/word.txt' 
  fp = open(path,'r') 
  word_list = [] 
  print '***message***',len(message) 
  for line in fp: 
    line = line[0:-1] 
    word_list.append(line) 
  fp.close() 
  print 'The count of word:',len(word_list) 
  start_time = time.time() 
  for i in range(1000): 
    res = is_contain2(message,word_list) 
    #print res 
  end_time = time.time() 
  print (end_time - start_time)  
 
 
if __name__ == '__main__': 
  dfa() 
  normal()

測(cè)試結(jié)果:

1) 敏感詞 100個(gè)

----------------dfa-----------
***message*** 224
0.325479984283
------------normal--------------
***message*** 224
The count of word: 100
0.107350111008

2) 敏感詞 1000 個(gè)

----------------dfa-----------
***message*** 224
0.324251890182
------------normal--------------
***message*** 224
The count of word: 1000
1.05939006805

從上面的實(shí)驗(yàn)我們可以看出，在DFA 算法只有在敏感詞較多的情況下，才有意義。在百來(lái)個(gè)敏感詞的情況下，甚至不如普通算法

下面從理論上推導(dǎo)時(shí)間復(fù)雜度，為了方便分析，首先假定消息文本是等長(zhǎng)的,長(zhǎng)度為lenA；每個(gè)敏感詞的長(zhǎng)度相同，長(zhǎng)度為lenB，敏感詞的個(gè)數(shù)是m。

1) DFA算法的核心是構(gòu)建一棵多叉樹(shù)，由于我們已經(jīng)假設(shè)，敏感詞的長(zhǎng)度相同，所以樹(shù)的最大深度為lenB，那么我們可以說(shuō)從消息文本的某個(gè)位置（字節(jié))開(kāi)始的某個(gè)子串是否在敏感詞樹(shù)中，最多只用經(jīng)過(guò)lenB次匹配.也就是說(shuō)判斷一個(gè)消息文本中是否有敏感詞的時(shí)間復(fù)雜度是lenA * lenB

2) 再來(lái)看看普通做法，是使用for循環(huán)，對(duì)每一個(gè)敏感詞，依次在消息文本中進(jìn)行查找，假定字符串是使用KMP算法,KMP算法的時(shí)間復(fù)雜度是O(lenA + lenB)

那么對(duì)m個(gè)敏感詞查找的時(shí)間復(fù)雜度是 (lenA + lenB ) * m

綜上所述，DFA 算法的時(shí)間復(fù)雜度基本上是與敏感詞的個(gè)數(shù)無(wú)關(guān)的。

以上這篇python 實(shí)現(xiàn)敏感詞過(guò)濾的方法就是小編分享給大家的全部?jī)?nèi)容了，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。