欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

java使用Nagao算法實現(xiàn)新詞發(fā)現(xiàn)、熱門詞的挖掘

 更新時間:2015年07月29日 09:50:52   投稿:hebedich  
這篇文章主要介紹了java使用Nagao算法實現(xiàn)新詞發(fā)現(xiàn)、熱門詞的挖掘的思路和詳細代碼,需要的朋友可以參考下

采用Nagao算法統(tǒng)計各個子字符串的頻次,然后基于這些頻次統(tǒng)計每個字符串的詞頻、左右鄰個數(shù)、左右熵、交互信息(內(nèi)部凝聚度)。

名詞解釋:

  Nagao算法:一種快速的統(tǒng)計文本里所有子字符串頻次的算法。詳細算法可見http://www.doc88.com/p-664123446503.html
  詞頻:該字符串在文檔中出現(xiàn)的次數(shù)。出現(xiàn)次數(shù)越多越重要。
  左右鄰個數(shù):文檔中該字符串的左邊和右邊出現(xiàn)的不同的字的個數(shù)。左右鄰越多,說明字符串成詞概率越高。
  左右熵:文檔中該字符串的左邊和右邊出現(xiàn)的不同的字的數(shù)量分布的熵。類似上面的指標,有一定區(qū)別。
  交互信息:每次將某字符串分成兩部分,左半部分字符串和右半部分字符串,計算其同時出現(xiàn)的概率除于其各自獨立出現(xiàn)的概率,最后取所有的劃分里面概率最小值。這個值越大,說明字符串內(nèi)部凝聚度越高,越可能成詞。

算法具體流程:

1.  將輸入文件逐行讀入,按照非漢字([^\u4E00-\u9FA5]+)以及停詞“的很了么呢是嘛個都也比還這于不與才上用就好在和對挺去后沒說”,
分成一個個字符串,代碼如下:
String[] phrases = line.split("[^\u4E00-\u9FA5]+|["+stopwords+"]");
停用詞可以修改。
2.  獲取所有切分后的字符串的左子串和右子串,分別加入左、右PTable
3.  對PTable排序,并計算LTable。LTable記錄的是,排序后的PTable中,下一個子串同上一個子串具有相同字符的數(shù)量
4.  遍歷PTable和LTable,即可得到所有子字符串的詞頻、左右鄰
5.  根據(jù)所有子字符串的詞頻、左右鄰結(jié)果,輸出字符串的詞頻、左右鄰個數(shù)、左右熵、交互信息

1.  NagaoAlgorithm.java

package com.algo.word;
 
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
 
public class NagaoAlgorithm {
   
  private int N;
   
  private List<String> leftPTable;
  private int[] leftLTable;
  private List<String> rightPTable;
  private int[] rightLTable;
  private double wordNumber;
   
  private Map<String, TFNeighbor> wordTFNeighbor;
   
  private final static String stopwords = "的很了么呢是嘛個都也比還這于不與才上用就好在和對挺去后沒說";
   
  private NagaoAlgorithm(){
    //default N = 5
    N = 5;
    leftPTable = new ArrayList<String>();
    rightPTable = new ArrayList<String>();
    wordTFNeighbor = new HashMap<String, TFNeighbor>();
  }
  //reverse phrase
  private String reverse(String phrase) {
    StringBuilder reversePhrase = new StringBuilder();
    for (int i = phrase.length() - 1; i >= 0; i--)
      reversePhrase.append(phrase.charAt(i));
    return reversePhrase.toString();
  }
  //co-prefix length of s1 and s2
  private int coPrefixLength(String s1, String s2){
    int coPrefixLength = 0;
    for(int i = 0; i < Math.min(s1.length(), s2.length()); i++){
      if(s1.charAt(i) == s2.charAt(i))  coPrefixLength++;
      else break;
    }
    return coPrefixLength;
  }
  //add substring of line to pTable
  private void addToPTable(String line){
    //split line according to consecutive none Chinese character
    String[] phrases = line.split("[^\u4E00-\u9FA5]+|["+stopwords+"]");
    for(String phrase : phrases){
      for(int i = 0; i < phrase.length(); i++)
        rightPTable.add(phrase.substring(i));
      String reversePhrase = reverse(phrase);
      for(int i = 0; i < reversePhrase.length(); i++)
        leftPTable.add(reversePhrase.substring(i));
      wordNumber += phrase.length();
    }
  }
   
  //count lTable
  private void countLTable(){
    Collections.sort(rightPTable);
    rightLTable = new int[rightPTable.size()];
    for(int i = 1; i < rightPTable.size(); i++)
      rightLTable[i] = coPrefixLength(rightPTable.get(i-1), rightPTable.get(i));
     
    Collections.sort(leftPTable);
    leftLTable = new int[leftPTable.size()];
    for(int i = 1; i < leftPTable.size(); i++)
      leftLTable[i] = coPrefixLength(leftPTable.get(i-1), leftPTable.get(i));
     
    System.out.println("Info: [Nagao Algorithm Step 2]: having sorted PTable and counted left and right LTable");
  }
  //according to pTable and lTable, count statistical result: TF, neighbor distribution
  private void countTFNeighbor(){
    //get TF and right neighbor
    for(int pIndex = 0; pIndex < rightPTable.size(); pIndex++){
      String phrase = rightPTable.get(pIndex);
      for(int length = 1 + rightLTable[pIndex]; length <= N && length <= phrase.length(); length++){
        String word = phrase.substring(0, length);
        TFNeighbor tfNeighbor = new TFNeighbor();
        tfNeighbor.incrementTF();
        if(phrase.length() > length)
          tfNeighbor.addToRightNeighbor(phrase.charAt(length));
        for(int lIndex = pIndex+1; lIndex < rightLTable.length; lIndex++){
          if(rightLTable[lIndex] >= length){
            tfNeighbor.incrementTF();
            String coPhrase = rightPTable.get(lIndex);
            if(coPhrase.length() > length)
              tfNeighbor.addToRightNeighbor(coPhrase.charAt(length));
          }
          else break;
        }
        wordTFNeighbor.put(word, tfNeighbor);
      }
    }
    //get left neighbor
    for(int pIndex = 0; pIndex < leftPTable.size(); pIndex++){
      String phrase = leftPTable.get(pIndex);
      for(int length = 1 + leftLTable[pIndex]; length <= N && length <= phrase.length(); length++){
        String word = reverse(phrase.substring(0, length));
        TFNeighbor tfNeighbor = wordTFNeighbor.get(word);
        if(phrase.length() > length)
          tfNeighbor.addToLeftNeighbor(phrase.charAt(length));
        for(int lIndex = pIndex + 1; lIndex < leftLTable.length; lIndex++){
          if(leftLTable[lIndex] >= length){
            String coPhrase = leftPTable.get(lIndex);
            if(coPhrase.length() > length)
              tfNeighbor.addToLeftNeighbor(coPhrase.charAt(length));
          }
          else break;
        }
      }
    }
    System.out.println("Info: [Nagao Algorithm Step 3]: having counted TF and Neighbor");
  }
  //according to wordTFNeighbor, count MI of word
  private double countMI(String word){
    if(word.length() <= 1)  return 0;
    double coProbability = wordTFNeighbor.get(word).getTF()/wordNumber;
    List<Double> mi = new ArrayList<Double>(word.length());
    for(int pos = 1; pos < word.length(); pos++){
      String leftPart = word.substring(0, pos);
      String rightPart = word.substring(pos);
      double leftProbability = wordTFNeighbor.get(leftPart).getTF()/wordNumber;
      double rightProbability = wordTFNeighbor.get(rightPart).getTF()/wordNumber;
      mi.add(coProbability/(leftProbability*rightProbability));
    }
    return Collections.min(mi);
  }
  //save TF, (left and right) neighbor number, neighbor entropy, mutual information
  private void saveTFNeighborInfoMI(String out, String stopList, String[] threshold){
    try {
      //read stop words file
      Set<String> stopWords = new HashSet<String>();
      BufferedReader br = new BufferedReader(new FileReader(stopList));
      String line;
      while((line = br.readLine()) != null){
        if(line.length() > 1)
          stopWords.add(line);
      }
      br.close();
      //output words TF, neighbor info, MI
      BufferedWriter bw = new BufferedWriter(new FileWriter(out));
      for(Map.Entry<String, TFNeighbor> entry : wordTFNeighbor.entrySet()){
        if( entry.getKey().length() <= 1 || stopWords.contains(entry.getKey()) ) continue;
        TFNeighbor tfNeighbor = entry.getValue();
         
         
        int tf, leftNeighborNumber, rightNeighborNumber;
        double mi;
        tf = tfNeighbor.getTF();
        leftNeighborNumber = tfNeighbor.getLeftNeighborNumber();
        rightNeighborNumber = tfNeighbor.getRightNeighborNumber();
        mi = countMI(entry.getKey());
        if(tf > Integer.parseInt(threshold[0]) && leftNeighborNumber > Integer.parseInt(threshold[1]) && 
            rightNeighborNumber > Integer.parseInt(threshold[2]) && mi > Integer.parseInt(threshold[3]) ){
          StringBuilder sb = new StringBuilder();
          sb.append(entry.getKey());
          sb.append(",").append(tf);
          sb.append(",").append(leftNeighborNumber);
          sb.append(",").append(rightNeighborNumber);
          sb.append(",").append(tfNeighbor.getLeftNeighborEntropy());
          sb.append(",").append(tfNeighbor.getRightNeighborEntropy());
          sb.append(",").append(mi).append("\n");
          bw.write(sb.toString());
        }
      }
      bw.close();
    } catch (IOException e) {
      throw new RuntimeException(e);
    }
    System.out.println("Info: [Nagao Algorithm Step 4]: having saved to file");
  }
  //apply nagao algorithm to input file
  public static void applyNagao(String[] inputs, String out, String stopList){
    NagaoAlgorithm nagao = new NagaoAlgorithm();
    //step 1: add phrases to PTable
    String line;
    for(String in : inputs){
      try {
        BufferedReader br = new BufferedReader(new FileReader(in));
        while((line = br.readLine()) != null){
          nagao.addToPTable(line);
        }
        br.close();
      } catch (IOException e) {
        throw new RuntimeException();
      }
    }
    System.out.println("Info: [Nagao Algorithm Step 1]: having added all left and right substrings to PTable");
    //step 2: sort PTable and count LTable
    nagao.countLTable();
    //step3: count TF and Neighbor
    nagao.countTFNeighbor();
    //step4: save TF NeighborInfo and MI
    nagao.saveTFNeighborInfoMI(out, stopList, "20,3,3,5".split(","));
  }
  public static void applyNagao(String[] inputs, String out, String stopList, int n, String filter){
    NagaoAlgorithm nagao = new NagaoAlgorithm();
    nagao.setN(n);
    String[] threshold = filter.split(",");
    if(threshold.length != 4){
      System.out.println("ERROR: filter must have 4 numbers, seperated with ',' ");
      return;
    }
    //step 1: add phrases to PTable
    String line;
    for(String in : inputs){
      try {
        BufferedReader br = new BufferedReader(new FileReader(in));
        while((line = br.readLine()) != null){
          nagao.addToPTable(line);
        }
        br.close();
      } catch (IOException e) {
        throw new RuntimeException();
      }
    }
    System.out.println("Info: [Nagao Algorithm Step 1]: having added all left and right substrings to PTable");
    //step 2: sort PTable and count LTable
    nagao.countLTable();
    //step3: count TF and Neighbor
    nagao.countTFNeighbor();
    //step4: save TF NeighborInfo and MI
    nagao.saveTFNeighborInfoMI(out, stopList, threshold);
  }
  private void setN(int n){
    N = n;
  }
   
  public static void main(String[] args) {
    String[] ins = {"E://test//ganfen.txt"};
    applyNagao(ins, "E://test//out.txt", "E://test//stoplist.txt");
  }
 
}

2. TFNeighbor.java

package com.algo.word;
 
import java.util.HashMap;
import java.util.Map;
 
public class TFNeighbor {
 
  private int tf;
  private Map<Character, Integer> leftNeighbor;
  private Map<Character, Integer> rightNeighbor;
   
  TFNeighbor(){
    leftNeighbor = new HashMap<Character, Integer>();
    rightNeighbor = new HashMap<Character, Integer>();
  }
  //add word to leftNeighbor
  public void addToLeftNeighbor(char word){
    //leftNeighbor.put(word, 1 + leftNeighbor.getOrDefault(word, 0));
    Integer number = leftNeighbor.get(word);
    leftNeighbor.put(word, number == null? 1: 1+number);
  }
  //add word to rightNeighbor
  public void addToRightNeighbor(char word){
    //rightNeighbor.put(word, 1 + rightNeighbor.getOrDefault(word, 0));
    Integer number = rightNeighbor.get(word);
    rightNeighbor.put(word, number == null? 1: 1+number);
  }
  //increment tf
  public void incrementTF(){
    tf++;
  }
  public int getLeftNeighborNumber(){
    return leftNeighbor.size();
  }
  public int getRightNeighborNumber(){
    return rightNeighbor.size();
  }
  public double getLeftNeighborEntropy(){
    double entropy = 0;
    int sum = 0;
    for(int number : leftNeighbor.values()){
      entropy += number*Math.log(number);
      sum += number;
    }
    if(sum == 0)  return 0;
    return Math.log(sum) - entropy/sum;
  }
  public double getRightNeighborEntropy(){
    double entropy = 0;
    int sum = 0;
    for(int number : rightNeighbor.values()){
      entropy += number*Math.log(number);
      sum += number;
    }
    if(sum == 0)  return 0;
    return Math.log(sum) - entropy/sum;
  }
  public int getTF(){
    return tf;
  }
}

3. Main.java

package com.algo.word;
 
public class Main {
 
  public static void main(String[] args) {
     
    //if 3 arguments, first argument is input files splitting with ','
    //second argument is output file
    //output 7 columns split with ',' , like below:
    //word, term frequency, left neighbor number, right neighbor number, left neighbor entropy, right neighbor entropy, mutual information
    //third argument is stop words list
    if(args.length == 3)
      NagaoAlgorithm.applyNagao(args[0].split(","), args[1], args[2]);
     
    //if 4 arguments, forth argument is the NGram parameter N
    //5th argument is threshold of output words, default is "20,3,3,5"
    //output TF > 20 && (left | right) neighbor number > 3 && MI > 5
    else if(args.length == 5)
      NagaoAlgorithm.applyNagao(args[0].split(","), args[1], args[2], Integer.parseInt(args[3]), args[4]);
     
     
  }
 
}

以上所述就是本文的全部內(nèi)容了,希望大家能夠喜歡。

相關(guān)文章

  • Java初始化塊及執(zhí)行過程解析

    Java初始化塊及執(zhí)行過程解析

    這篇文章主要介紹了Java初始化塊及執(zhí)行過程解析,文中通過示例代碼介紹的非常詳細,對大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價值,需要的朋友可以參考下
    2019-09-09
  • 淺聊一下Spring中Bean的配置細節(jié)

    淺聊一下Spring中Bean的配置細節(jié)

    我們知道,當寫完一個普通的 Java 類后,想讓 Spring IoC 容器在創(chuàng)建類的實例對象時使用構(gòu)造方法完成實例對象的依賴注入,那么就需要在配置元數(shù)據(jù)中寫好類的 Bean 定義,包括各種標簽的屬性。所以本文我們來說說這其中的配置細節(jié),需要的朋友可以參考下
    2023-07-07
  • springboot log4j2不能打印框架錯誤日志的解決方案

    springboot log4j2不能打印框架錯誤日志的解決方案

    這篇文章主要介紹了springboot log4j2不能打印框架錯誤日志的解決方案,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2021-08-08
  • mybatis如何實現(xiàn)繼承映射

    mybatis如何實現(xiàn)繼承映射

    這篇文章主要介紹了mybatis如何實現(xiàn)繼承映射的操作,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2021-08-08
  • IntelliJ IDEA2021.1 配置大全(超詳細教程)

    IntelliJ IDEA2021.1 配置大全(超詳細教程)

    這篇文章主要介紹了IntelliJ IDEA2021.1 配置大全(超詳細教程),需要的朋友可以參考下
    2021-04-04
  • Java單例模式的講解

    Java單例模式的講解

    今天小編就為大家分享一篇關(guān)于Java單例模式的講解,小編覺得內(nèi)容挺不錯的,現(xiàn)在分享給大家,具有很好的參考價值,需要的朋友一起跟隨小編來看看吧
    2019-01-01
  • Springboot事務(wù)失效的幾種情況解讀

    Springboot事務(wù)失效的幾種情況解讀

    這篇文章主要介紹了Springboot事務(wù)失效的幾種情況解讀,因為Spring AOP默認使用動態(tài)代理,會給被代理的類生成一個代理類,事務(wù)相關(guān)的操作都通過代理來完成,使用內(nèi)部方法調(diào)用時,使用的是實例調(diào)用,沒有通過代理類調(diào)用方法,因此事務(wù)不會檢測到失敗,需要的朋友可以參考下
    2023-10-10
  • java用戶管理注冊功能 含前后臺代碼

    java用戶管理注冊功能 含前后臺代碼

    這篇文章主要介紹了java用戶管理注冊功能,含前端和后臺代碼,具有一定的參考價值,感興趣的小伙伴們可以參考一下
    2016-10-10
  • mybatis if test判斷BigDecimal遇到的坑及解決

    mybatis if test判斷BigDecimal遇到的坑及解決

    這篇文章主要介紹了mybatis if test判斷BigDecimal遇到的坑及解決方案,具有很好的參考價值,希望對大家有所幫助。如有錯誤或未考慮完全的地方,望不吝賜教
    2023-03-03
  • 深入淺析MyBatis foreach標簽

    深入淺析MyBatis foreach標簽

    Mybatis foreach 標簽用于循環(huán)語句,它很好的支持了數(shù)據(jù)和 List、set 接口的集合,并對此提供遍歷的功能,本文給大家介紹MyBatis foreach標簽的相關(guān)知識,感興趣的朋友一起看看吧
    2021-09-09

最新評論