詳解Huffman編碼算法之Java實(shí)現(xiàn)
Huffman編碼介紹
Huffman編碼處理的是字符以及字符對(duì)應(yīng)的二進(jìn)制的編碼配對(duì)問題,分為編碼和解碼,目的是壓縮字符對(duì)應(yīng)的二進(jìn)制數(shù)據(jù)長度。我們知道字符存貯和傳輸?shù)臅r(shí)候都是二進(jìn)制的(計(jì)算機(jī)只認(rèn)識(shí)0/1),那么就有字符與二進(jìn)制之間的mapping關(guān)系。字符屬于字符集(Charset), 字符需要通過編碼(encode)為二進(jìn)制進(jìn)行存貯和傳輸,顯示的時(shí)候需要解碼(decode)回字符,字符集與編碼方法是一對(duì)多關(guān)系(Unicode可以用UTF-8,UTF-16等編碼)。理解了字符集,編碼以及解碼,滿天飛的亂碼問題也就游刃而解了。以英文字母小寫a為例, ASCII編碼中,十進(jìn)制為97,二進(jìn)制為01100001。ASCII的每一個(gè)字符都用8個(gè)Bit(1Byte)編碼,假如有1000個(gè)字符要傳輸,那么就要傳輸8000個(gè)Bit。問題來了,英文中字母e的使用頻率為12.702%,而z為0.074%,前者是后者的100多倍,但是確使用相同位數(shù)的二進(jìn)制??梢宰龅酶?,方法就是可變長度編碼,指導(dǎo)原則就是頻率高的用較短的位數(shù)編碼,頻率低的用較長位數(shù)編碼。Huffman編碼算法就是處理這樣的問題。
Huffman編碼Java實(shí)現(xiàn)
Huffman編碼算法主要用到的數(shù)據(jù)結(jié)構(gòu)是完全二叉樹(full binary tree)和優(yōu)先級(jí)隊(duì)列。后者用的是Java.util.PriorityQueue,前者自己實(shí)現(xiàn)(都為內(nèi)部類),代碼如下:
static class Tree {
private Node root;
public Node getRoot() {
return root;
}
public void setRoot(Node root) {
this.root = root;
}
}
static class Node implements Comparable<Node> {
private String chars = "";
private int frequence = 0;
private Node parent;
private Node leftNode;
private Node rightNode;
@Override
public int compareTo(Node n) {
return frequence - n.frequence;
}
public boolean isLeaf() {
return chars.length() == 1;
}
public boolean isRoot() {
return parent == null;
}
public boolean isLeftChild() {
return parent != null && this == parent.leftNode;
}
public int getFrequence() {
return frequence;
}
public void setFrequence(int frequence) {
this.frequence = frequence;
}
public String getChars() {
return chars;
}
public void setChars(String chars) {
this.chars = chars;
}
public Node getParent() {
return parent;
}
public void setParent(Node parent) {
this.parent = parent;
}
public Node getLeftNode() {
return leftNode;
}
public void setLeftNode(Node leftNode) {
this.leftNode = leftNode;
}
public Node getRightNode() {
return rightNode;
}
public void setRightNode(Node rightNode) {
this.rightNode = rightNode;
}
}
統(tǒng)計(jì)數(shù)據(jù)
既然要按頻率來安排編碼表,那么首先當(dāng)然得獲得頻率的統(tǒng)計(jì)信息。我實(shí)現(xiàn)了一個(gè)方法處理這樣的問題。如果已經(jīng)有統(tǒng)計(jì)信息,那么轉(zhuǎn)為Map<Character,Integer>即可。如果你得到的信息是百分比,乘以100或1000,或10000??偸强梢赞D(zhuǎn)為整數(shù)。比如12.702%乘以1000為12702,Huffman編碼只關(guān)心大小問題。統(tǒng)計(jì)方法實(shí)現(xiàn)如下:
public static Map<Character, Integer> statistics(char[] charArray) {
Map<Character, Integer> map = new HashMap<Character, Integer>();
for (char c : charArray) {
Character character = new Character(c);
if (map.containsKey(character)) {
map.put(character, map.get(character) + 1);
} else {
map.put(character, 1);
}
}
return map;
}
構(gòu)建樹
構(gòu)建樹是Huffman編碼算法的核心步驟。思想是把所有的字符掛到一顆完全二叉樹的葉子節(jié)點(diǎn),任何一個(gè)非頁子節(jié)點(diǎn)的左節(jié)點(diǎn)出現(xiàn)頻率不大于右節(jié)點(diǎn)。算法為把統(tǒng)計(jì)信息轉(zhuǎn)為Node存放到一個(gè)優(yōu)先級(jí)隊(duì)列里面,每一次從隊(duì)列里面彈出兩個(gè)最小頻率的節(jié)點(diǎn),構(gòu)建一個(gè)新的父Node(非葉子節(jié)點(diǎn)), 字符內(nèi)容剛彈出來的兩個(gè)節(jié)點(diǎn)字符內(nèi)容之和,頻率也是它們的和,最開始的彈出來的作為左子節(jié)點(diǎn),后面一個(gè)作為右子節(jié)點(diǎn),并且把剛構(gòu)建的父節(jié)點(diǎn)放到隊(duì)列里面。重復(fù)以上的動(dòng)作N-1次,N為不同字符的個(gè)數(shù)(每一次隊(duì)列里面?zhèn)€數(shù)減1)。結(jié)束以上步驟,隊(duì)列里面剩一個(gè)節(jié)點(diǎn),彈出作為樹的根節(jié)點(diǎn)。代碼如下:
private static Tree buildTree(Map<Character, Integer> statistics,
List<Node> leafs) {
Character[] keys = statistics.keySet().toArray(new Character[0]);
PriorityQueue<Node> priorityQueue = new PriorityQueue<Node>();
for (Character character : keys) {
Node node = new Node();
node.chars = character.toString();
node.frequence = statistics.get(character);
priorityQueue.add(node);
leafs.add(node);
}
int size = priorityQueue.size();
for (int i = 1; i <= size - 1; i++) {
Node node1 = priorityQueue.poll();
Node node2 = priorityQueue.poll();
Node sumNode = new Node();
sumNode.chars = node1.chars + node2.chars;
sumNode.frequence = node1.frequence + node2.frequence;
sumNode.leftNode = node1;
sumNode.rightNode = node2;
node1.parent = sumNode;
node2.parent = sumNode;
priorityQueue.add(sumNode);
}
Tree tree = new Tree();
tree.root = priorityQueue.poll();
return tree;
}
編碼
某個(gè)字符對(duì)應(yīng)的編碼為,從該字符所在的葉子節(jié)點(diǎn)向上搜索,如果該字符節(jié)點(diǎn)是父節(jié)點(diǎn)的左節(jié)點(diǎn),編碼字符之前加0,反之如果是右節(jié)點(diǎn),加1,直到根節(jié)點(diǎn)。只要獲取了字符和二進(jìn)制碼之間的mapping關(guān)系,編碼就非常簡單。代碼如下:
public static String encode(String originalStr,
Map<Character, Integer> statistics) {
if (originalStr == null || originalStr.equals("")) {
return "";
}
char[] charArray = originalStr.toCharArray();
List<Node> leafNodes = new ArrayList<Node>();
buildTree(statistics, leafNodes);
Map<Character, String> encodInfo = buildEncodingInfo(leafNodes);
StringBuffer buffer = new StringBuffer();
for (char c : charArray) {
Character character = new Character(c);
buffer.append(encodInfo.get(character));
}
return buffer.toString();
}
private static Map<Character, String> buildEncodingInfo(List<Node> leafNodes) {
Map<Character, String> codewords = new HashMap<Character, String>();
for (Node leafNode : leafNodes) {
Character character = new Character(leafNode.getChars().charAt(0));
String codeword = "";
Node currentNode = leafNode;
do {
if (currentNode.isLeftChild()) {
codeword = "0" + codeword;
} else {
codeword = "1" + codeword;
}
currentNode = currentNode.parent;
} while (currentNode.parent != null);
codewords.put(character, codeword);
}
return codewords;
}
解碼
因?yàn)镠uffman編碼算法能夠保證任何的二進(jìn)制碼都不會(huì)是另外一個(gè)碼的前綴,解碼非常簡單,依次取出二進(jìn)制的每一位,從樹根向下搜索,1向右,0向左,到了葉子節(jié)點(diǎn)(命中),退回根節(jié)點(diǎn)繼續(xù)重復(fù)以上動(dòng)作。代碼如下:
public static String decode(String binaryStr,
Map<Character, Integer> statistics) {
if (binaryStr == null || binaryStr.equals("")) {
return "";
}
char[] binaryCharArray = binaryStr.toCharArray();
LinkedList<Character> binaryList = new LinkedList<Character>();
int size = binaryCharArray.length;
for (int i = 0; i < size; i++) {
binaryList.addLast(new Character(binaryCharArray[i]));
}
List<Node> leafNodes = new ArrayList<Node>();
Tree tree = buildTree(statistics, leafNodes);
StringBuffer buffer = new StringBuffer();
while (binaryList.size() > 0) {
Node node = tree.root;
do {
Character c = binaryList.removeFirst();
if (c.charValue() == '0') {
node = node.leftNode;
} else {
node = node.rightNode;
}
} while (!node.isLeaf());
buffer.append(node.chars);
}
return buffer.toString();
}
測(cè)試以及比較
以下測(cè)試Huffman編碼的正確性(先編碼,后解碼,包括中文),以及Huffman編碼與常見的字符編碼的二進(jìn)制字符串比較。代碼如下:
public static void main(String[] args) {
String oriStr = "Huffman codes compress data very effectively: savings of 20% to 90% are typical, "
+ "depending on the characteristics of the data being compressed. 中華崛起";
Map<Character, Integer> statistics = statistics(oriStr.toCharArray());
String encodedBinariStr = encode(oriStr, statistics);
String decodedStr = decode(encodedBinariStr, statistics);
System.out.println("Original sstring: " + oriStr);
System.out.println("Huffman encoed binary string: " + encodedBinariStr);
System.out.println("decoded string from binariy string: " + decodedStr);
System.out.println("binary string of UTF-8: "
+ getStringOfByte(oriStr, Charset.forName("UTF-8")));
System.out.println("binary string of UTF-16: "
+ getStringOfByte(oriStr, Charset.forName("UTF-16")));
System.out.println("binary string of US-ASCII: "
+ getStringOfByte(oriStr, Charset.forName("US-ASCII")));
System.out.println("binary string of GB2312: "
+ getStringOfByte(oriStr, Charset.forName("GB2312")));
}
public static String getStringOfByte(String str, Charset charset) {
if (str == null || str.equals("")) {
return "";
}
byte[] byteArray = str.getBytes(charset);
int size = byteArray.length;
StringBuffer buffer = new StringBuffer();
for (int i = 0; i < size; i++) {
byte temp = byteArray[i];
buffer.append(getStringOfByte(temp));
}
return buffer.toString();
}
public static String getStringOfByte(byte b) {
StringBuffer buffer = new StringBuffer();
for (int i = 7; i >= 0; i--) {
byte temp = (byte) ((b >> i) & 0x1);
buffer.append(String.valueOf(temp));
}
return buffer.toString();
}
以上就是本文的全部內(nèi)容,希望對(duì)大家的學(xué)習(xí)有所幫助,也希望大家多多支持腳本之家。
相關(guān)文章
jeefast和Mybatis實(shí)現(xiàn)三級(jí)聯(lián)動(dòng)的示例代碼
這篇文章主要介紹了jeefast和Mybatis實(shí)現(xiàn)三級(jí)聯(lián)動(dòng)的示例代碼,代碼簡單易懂,對(duì)大家的學(xué)習(xí)或工作具有一定的參考借鑒價(jià)值,需要的朋友可以參考下2020-10-10
SpringBoot+MyBatis-Plus實(shí)現(xiàn)分頁功能
在SpringBoot項(xiàng)目中,結(jié)合MyBatis-Plus(簡稱MP)可以非常方便地實(shí)現(xiàn)分頁功能,MP為開發(fā)者提供了分頁插件PaginationInterceptor,只需簡單配置即可使用,本文給大家介紹了SpringBoot+MyBatis-Plus實(shí)現(xiàn)分頁功能,文中通過代碼示例給大家介紹的非常詳細(xì),需要的朋友可以參考下2024-01-01
Spring Security的持久化用戶和授權(quán)實(shí)現(xiàn)方式
文章介紹了如何使用JdbcUserDetailsManager實(shí)現(xiàn)數(shù)據(jù)庫讀取用戶,并展示了如何配置SpringSecurity進(jìn)行授權(quán)管理,通過創(chuàng)建數(shù)據(jù)庫表、配置數(shù)據(jù)庫連接和修改SecurityConfig,實(shí)現(xiàn)了用戶權(quán)限的控制2025-02-02
webservice實(shí)現(xiàn)springboot項(xiàng)目間接口調(diào)用與對(duì)象傳遞示例
本文主要介紹了webservice實(shí)現(xiàn)springboot項(xiàng)目間接口調(diào)用與對(duì)象傳遞示例,文中通過示例代碼介紹的非常詳細(xì),對(duì)大家的學(xué)習(xí)或者工作具有一定的參考學(xué)習(xí)價(jià)值,需要的朋友們下面隨著小編來一起學(xué)習(xí)學(xué)習(xí)吧2022-07-07
Java集合之Set接口及其實(shí)現(xiàn)類精解
set接口是繼承自Collection的子接口,特點(diǎn)是元素不重復(fù),存儲(chǔ)無序。在set接口的實(shí)現(xiàn)類中添加重復(fù)元素是不會(huì)成功的,判斷兩個(gè)元素是否重復(fù)根據(jù)元素類重寫的2021-09-09
分析Java中ArrayList與LinkedList列表結(jié)構(gòu)的源碼
這篇文章主要介紹了Java中ArrayList與LinkedList列表結(jié)構(gòu)的源碼,文章最后對(duì)LinkedList和ArrayList以及Vector的特性有一個(gè)對(duì)比總結(jié),需要的朋友可以參考下2016-05-05
Java線程安全和鎖Synchronized知識(shí)點(diǎn)詳解
在本篇文章里小編給大家分享的是關(guān)于Java線程安全和鎖Synchronized相關(guān)知識(shí)點(diǎn),有需要的朋友們可以參考下。2019-08-08

