快捷導(dǎo)航

IKAnalyzer結(jié)合Lucene實(shí)現(xiàn)中文分詞(示例講解)

更新時(shí)間：2017年10月13日 09:03:51 作者：funnyboy0128

下面小編就為大家?guī)硪黄狪KAnalyzer結(jié)合Lucene實(shí)現(xiàn)中文分詞(示例講解)。小編覺得挺不錯(cuò)的，現(xiàn)在就分享給大家，也給大家做個(gè)參考。一起跟隨小編過來看看吧

1、基本介紹

隨著分詞在信息檢索領(lǐng)域應(yīng)用的越來越廣泛，分詞這門技術(shù)對(duì)大家并不陌生。對(duì)于英文分詞處理相對(duì)簡單，經(jīng)過拆分單詞、排斥停止詞、提取詞干的過程基本就能實(shí)現(xiàn)英文分詞，單對(duì)于中文分詞而言，由于語義的復(fù)雜導(dǎo)致分詞并沒英文分詞那么簡單，一般都是通過相關(guān)的分詞工具來實(shí)現(xiàn)，目前比較常用的有庖丁分詞以及IKAnalyzer等。這里我們主要通過一個(gè)簡單的Demo聊聊IKAnalyzer的基本使用。IKAnalyzer是一個(gè)開源的，基于java開發(fā)的分詞工具包，它獨(dú)立于Lucene項(xiàng)目，同時(shí)提供了Lucene的默認(rèn)實(shí)現(xiàn)。

2、IKAnalyzer結(jié)合Lucene實(shí)現(xiàn)簡單的中文分詞

我們通過一個(gè)基本的Demo來實(shí)踐說明，步驟如下：

step1：準(zhǔn)備相關(guān)的Jar依賴，lucene-core-5.1.0.jar、ik.jar，然后新建項(xiàng)目，引入相關(guān)依賴項(xiàng)目結(jié)構(gòu)如下：

IkDemo-src
　　　　　-con.funnyboy.ik
-IKAnalyzer.cfg.xml
　　　　　-stopword.dic
-ext.dic
-Reference Libraries
　　　　　-lucene-core-5.1.0.jar
　　　　　-ik.jar

IKAnalyzer.cfg.xml：配置擴(kuò)展詞典以及停止詞典內(nèi)容如下：

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd"> 
<properties> <comment>IK Analyzer 擴(kuò)展配置</comment>
 　　<entry key="ext_dict">ext.dic;</entry>
 　　<entry key="ext_stopwords">stopword.dic;</entry>
</properties>

其中的ext.dic配置自己的擴(kuò)展字典，stopword.dic配置自己的擴(kuò)展停止詞字典

step2：通過java代碼驗(yàn)證測試

public class MyIkTest {
　　public static String str = "中國人民銀行我是中國人";
　　public static void main(String[] args) { 
　　　　 MyIkTest test = new MyIkTest(); 
　　　　 test.wordCount("", str);
　　 }
 　　private void wordCount(String arg,String content) {
 　　　 Analyzer analyzer = new IKAnalyzer(true); // IK實(shí)現(xiàn)分詞 true:用最大詞長分詞 false:最細(xì)粒度切分 
　　　　StringReader reader = null; 
　　　　TokenStream ts = null; try { 
　　　　　　reader = new StringReader(content); 
　　　　　　ts = analyzer.tokenStream(arg,reader); 
　　　　　　CharTermAttribute term = ts.addAttribute(CharTermAttribute.class); 
　　　　　　ts.reset(); 
　　　　　　Map<String, Integer> map = new HashMap<String, Integer>(); //統(tǒng)計(jì) 
　　　　　　while (ts.incrementToken()) { 
　　　　　　　　String str = term.toString(); 
　　　　　　　　Object o = map.get(str); 
　　　　　　　　if (o == null) { 
　　　　　　　　　　map.put(str, new Integer(1)); 
　　　　　　　　 } else { 
　　　　　　　　　　Integer i = new Integer(((Integer) o).intValue() + 1); 
　　　　　　　　　　 map.put(str, i); 
　　　　　　　　} 
　　　　　　 } 
　　　　　　List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(map.entrySet()); 
　　　　　　Collections.sort(list,new Comparator<Map.Entry<String, Integer>>() { 
　　　　　　　　public int compare(Map.Entry<String, Integer> o1,Map.Entry<String, Integer> o2) { 
　　　　　　　　　　return (o2.getValue() - o1.getValue()); 
　　　　　　　　} });  
　　　　　　 for (int k=0;k<list.size();k++) { 
　　　　　　　　Entry<String, Integer> it=list.get(k); 
　　　　　　　　String word = it.getKey().toString(); 
　　　　　　　　System.err.println(word+"["+it.getValue()+"]"); 
　　　　　　　}  
　　　　} catch (Exception e) {
 　　　 } finally { 
　　　　　　if(reader != null){ 
　　　　　　　　 reader.close(); 
　　　　　　} 
　　　　　　if (analyzer != null) { 
　　　　　　　　analyzer.close(); 
　　　　　　} 
　　　　 } 
　　　}
　　}

執(zhí)行程序測試結(jié)果如下：

中國人民銀行[1]

中國人[1]

我[1]

3、配置說明

a、如何自定義配置擴(kuò)展詞典和停止詞典 IKAnalyzer.cfg.xml中定義了擴(kuò)展詞典和停止詞典，如果有多好個(gè)可以通過;配置多個(gè)。擴(kuò)展詞典是指用戶可以根據(jù)自己定義的詞義實(shí)現(xiàn)分詞，比如人名在默認(rèn)的詞典中并未實(shí)現(xiàn)，需要自定義實(shí)現(xiàn)分詞，卡可以通過在ext.dic中新增自定義的詞語。停止詞是指對(duì)于分詞沒有實(shí)際意義但出現(xiàn)頻率很高的詞，比如嗎、乎等語氣詞，用戶也可以通過在stopword.dic中自定義相關(guān)的停止詞。

b、關(guān)于最大詞長分詞和最小粒度分詞的區(qū)分在IKAnalyzer構(gòu)造方法中可以通過提供一個(gè)標(biāo)示來實(shí)現(xiàn)最大詞長分詞和最小粒度分詞，true為最大詞長分詞，默認(rèn)是最小粒度分詞。對(duì)"中國人民銀行我是中國人"分別測試結(jié)果如下：

最大詞長分詞結(jié)果如下：

中國人民銀行[1]

中國人[1]

我[1]

最小粒度分詞結(jié)果如下：

國人[2]
中國人[2]
中國[2]
人民[1]
中國人民銀行[1]
我[1]
人民銀行[1]
中國人民[1]
銀行[1]

以上這篇IKAnalyzer結(jié)合Lucene實(shí)現(xiàn)中文分詞(示例講解)就是小編分享給大家的全部內(nèi)容了，希望能給大家一個(gè)參考，也希望大家多多支持腳本之家。

您可能感興趣的文章: