腳本之家服務器常用軟件

快捷導航

使用C#和Jieba.NET實現中英文混合文本關鍵詞的提取功能

更新時間：2025年03月12日 09:01:02 作者：老胖閑聊

Jieba.NET?是一個在?C#?中實現的分詞庫,它基于?Java?的?jieba?分詞庫,并進行了?C#?語言的移植,Jieba?是一個高效的中文分詞工具,能夠處理全模式、精確模式以及搜索引擎模式,本文給大家介紹了如何使用C#和Jieba.NET實現中英文混合文本關鍵詞的提取功能

實現步驟

創(chuàng)建Windows窗體應用程序
添加以下控件：
- TextBox：輸入文本（支持多行）
- Button：觸發(fā)分詞
- ListBox：顯示關鍵詞及詞頻
安裝NuGet包

Install-Package jieba.NET

完整代碼實現

using System;
using System.Collections.Generic;
using System.Linq;
using System.Windows.Forms;
using JiebaNet.Segmenter;
using System.Text.RegularExpressions;

public partial class MainForm : Form
{
    private TextBox inputBox;
    private Button analyzeButton;
    private ListBox resultList;
    private HashSet<string> stopWords;

    public MainForm()
    {
        InitializeComponent();
        InitializeStopWords(); // 初始化停用詞表
    }

    private void InitializeStopWords()
    {
        // 中英文停用詞表（示例）
        stopWords = new HashSet<string>
        {
            // 中文停用詞
            "的", "了", "在", "是", "我", "和", "有", "就", "不", "人",
            // 英文停用詞
            "a", "an", "the", "is", "are", "and", "in", "on", "at"
        };
    }

    private void AnalyzeButton_Click(object sender, EventArgs e)
    {
        string inputText = inputBox.Text.Trim();
        if (string.IsNullOrEmpty(inputText))
        {
            MessageBox.Show("請輸入文本！");
            return;
        }

        // 使用Jieba進行分詞（處理中文和英文混合）
        var segmenter = new JiebaSegmenter();
        var segments = segmenter.Cut(inputText);

        // 提取英文單詞（通過正則表達式補充處理）
        var allWords = new List<string>();
        foreach (var seg in segments)
        {
            // 處理中英文混合詞（如 "C#編程" -> ["C#", "編程"]）
            var words = Regex.Matches(seg, @"([A-Za-z0-9#+]+)|([\u4e00-\u9fa5]+)")
                .Cast<Match>()
                .Select(m => m.Value.ToLower());
            allWords.AddRange(words);
        }

        // 過濾停用詞和單字詞
        var filteredWords = allWords
            .Where(word => !stopWords.Contains(word) && word.Length >= 2);

        // 統(tǒng)計詞頻并排序
        var keywordCounts = filteredWords
            .GroupBy(word => word)
            .OrderByDescending(g => g.Count())
            .Select(g => $"{g.Key} ({g.Count()})")
            .ToList();

        // 顯示結果
        resultList.DataSource = keywordCounts;
    }

    // 初始化窗體控件
    private void InitializeComponent()
    {
        this.inputBox = new TextBox();
        this.analyzeButton = new Button();
        this.resultList = new ListBox();

        // 布局控件
        this.inputBox.Multiline = true;
        this.inputBox.Location = new System.Drawing.Point(20, 20);
        this.inputBox.Size = new System.Drawing.Size(400, 150);

        this.analyzeButton.Text = "提取關鍵詞";
        this.analyzeButton.Location = new System.Drawing.Point(20, 180);
        this.analyzeButton.Click += AnalyzeButton_Click;

        this.resultList.Location = new System.Drawing.Point(20, 220);
        this.resultList.Size = new System.Drawing.Size(400, 200);

        this.ClientSize = new System.Drawing.Size(440, 440);
        this.Controls.Add(inputBox);
        this.Controls.Add(analyzeButton);
        this.Controls.Add(resultList);
    }
}

功能說明

中英文混合分詞
- 使用 Jieba.NET 處理中文分詞。
- 通過正則表達式 ([A-Za-z0-9#+]+) 提取英文單詞和數字（如 C#、Python3）。
停用詞過濾
- 內置中英文停用詞表（如 “的”、“and”），過濾無意義詞匯。
- 過濾長度小于2的字符（如單字詞）。
詞頻統(tǒng)計
- 統(tǒng)計關鍵詞出現次數并按頻率降序排列。

擴展建議

加載外部停用詞表
從文件加載更全面的停用詞（如 stopwords.txt）：

private void LoadStopWordsFromFile(string path)
{
    stopWords = new HashSet<string>(File.ReadAllLines(path));
}

詞性過濾
使用 Jieba.NET 的詞性標注功能，僅保留名詞、動詞等關鍵詞：

var posSegmenter = new PosSegmenter();
var posTags = posSegmenter.Cut(inputText);
var nouns = posTags.Where(tag => tag.Flag.StartsWith("n"));

TF-IDF算法
實現更高級的關鍵詞權重計算（需引入TF-IDF庫）。

使用 Jieba.NET 進行中文分詞

安裝完成后，你就可以在你的 .NET 項目中使用 Jieba.NET 進行中文分詞了。以下是一個簡單的示例：

using JiebaNet.Segmenter;
using System;
 
class Program
{
    static void Main(string[] args)
    {
        var segmenter = new JiebaSegmenter();
        string text = "我愛北京天安門";
        var words = segmenter.Cut(text);
        foreach (var word in words)
        {
            Console.WriteLine(word);
        }
    }
}

在上面的示例中，我們首先創(chuàng)建了一個 JiebaSegmenter 實例，然后使用 Cut 方法對字符串 "我愛北京天安門" 進行分詞。分詞結果會以 IEnumerable的形式返回，我們可以遍歷這個結果并打印出每個詞語。

分詞模式選擇

Jieba.NET 提供了三種分詞模式：精確模式、全模式和搜索引擎模式。你可以根據需要選擇合適的模式。

精確模式：試圖將句子最精確地切開，適合文本分析。
全模式：把句子中所有的可以成詞的詞語都掃描出來，速度非?？?，但是不能解決歧義問題。
搜索引擎模式：在精確模式的基礎上，對長詞再次切分，提高召回率，適合用于搜索引擎分詞。

你可以通過 Cut 方法的重載版本來指定分詞模式，例如：

var words = segmenter.Cut(text, cutMode: CutMode.Full); // 使用全模式進行分詞

添加自定義詞典

Jieba.NET 還支持自定義詞典功能，你可以將特定的詞匯添加到詞典中，以確保它們能夠被正確地識別為一個詞。例如：

segmenter.AddWord("天安門廣場"); // 將“天安門廣場”添加到詞典中

添加自定義詞典后，當你對包含這些詞匯的文本進行分詞時，Jieba.NET 會將它們作為一個整體進行切分。

以上就是使用C#和Jieba.NET實現中英文混合文本關鍵詞的提取功能的詳細內容，更多關于C# Jieba.NET關鍵詞提取的資料請關注腳本之家其它相關文章！

您可能感興趣的文章:

欧美bbbwbbbw肥妇,免费乱码人妻系列日韩,一级黄片

使用C#和Jieba.NET實現中英文混合文本關鍵詞的提取功能

目錄

實現步驟

功能說明

擴展建議

使用 Jieba.NET 進行中文分詞

分詞模式選擇

添加自定義詞典

相關文章

最新評論

大家感興趣的內容

最近更新的內容

常用在線小工具