April 19, 2014
Hot Topics:
RSS RSS feed Download our iPhone app

Understanding the Huffman Data Compression Algorithm in Java

  • May 2, 2006
  • By Richard G. Baldwin
  • Send Email »
  • More Articles »

Java Programming Notes # 2442


Preface

This is the second lesson in a series of lessons that will teach you about data and image compression.  The series began with the lesson entitled Understanding the Lempel-Ziv Data Compression Algorithm in Java (commonly known as LZ77).

Different variations of the LZ algorithms, the Huffman algorithm, and other compression algorithms are often combined in data and image compression programs.  For example, numerous sources on the web indicate that commercially available zip programs often incorporate something called DEFLATE.  According to Wikipedia,

"DEFLATE is a lossless data compression algorithm that uses a combination of the LZ77 algorithm and Huffman coding.  It was originally defined by Phil Katz for version 2 of his PKZIP archiving tool, and was later specified in RFC 1951."

This lesson will teach you about Huffman coding.

Future lessons will cover a variety of compression schemes, including:

  • Run-length data encoding
  • GIF image compression
  • JPEG image compression

Use at your own risk

The programs that I will provide in these lessons are provided for educational purposes only.  If you use these programs for any purpose, you are using them at your own risk.  I accept no responsibility for any damages that you may incur as a result of the use of these programs.

Viewing tip

You may find it useful to open another copy of this lesson in a separate browser window.  That will make it easier for you to scroll back and forth among the different listings and figures while you are reading about them.

Supplementary material

I recommend that you also study the other lessons in my extensive collection of online Java tutorials.  You will find those lessons published at Gamelan.com.  However, as of the date of this writing, Gamelan doesn't maintain a consolidated index of my Java tutorial lessons, and sometimes they are difficult to locate there.  You will find a consolidated index at www.DickBaldwin.com.

I particularly recommend that you study my earlier lessons in the section entitled References in preparation for understanding the material in this lesson.

General Background Information

Most of us use data or image compression on a daily basis without even thinking about it.  If you use one of the popular zip programs to archive your data, you are using a program that typically implements several different data compression algorithms in combination.  If you take pictures with a digital camera, you are probably creating files that describe images using the JPEG image compression format.

What is Huffman coding?

According to Wikipedia,

"... Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol (such as a character in a file) where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol."

Other variable-length entropy coding systems

Huffman coding is not the only encoding scheme to use variable-length code based on the probability of occurrence of the characters in a message.  For example, the International Morse Code, originally created by Samuel Morse in the mid-1830s is a variable-length code, apparently based on his concept of the probability of occurrence of the letters in the English alphabet.  For example, the code for the letter E is the shortest code.  The letter E occurs very frequently in English text.  The longest code is for the number 0.  Excluding the numbers, however, the two longest codes are for the letters Q and W.  The letter Q doesn't appear very often in English Text.

(As an aside, one might surmise that similar probability tables were used to develop the QWERTY keyboard where the key for the letter E is relatively easy to strike.  On the other hand, the key for the letter Q is more difficult for many people to strike because it requires the use of the little finger on the left hand.)

Morse is not a prefix-free code

While Morse code is easy for a trained radio operator to understand (this author was a radio operator in the U.S. Air Force for several years), the code has a characteristic that makes it difficult, or at least inefficient to implement using a computer.  In particular, the code sequences for the most probable characters occur as the beginning parts of the code sequences for other characters.  (For example, the entire code sequence for the letter E occurs as the first element in the codes for many other letters.)  A trained human can easily deal with that situation but a significant amount of overhead is required to cause a computer to deal with it.

Huffman is a prefix-free code

Huffman coding solves this problem.  The methodology used for Huffman coding results in a prefix-free code.  A prefix-free code is one in which the bit coding sequence representing some particular character is never a prefix of the bit coding sequence representing any other character.  For example, here is a possible bit sequence for a Huffman code on an alphabet with four characters where D is the most probable and A is the least probable:

A 110
D 0
C 10
B 111

Code length is based on probability of occurrence

As with Morse code, the methodology used for the Huffman coding causes the bit coding sequence to be shortest for the most frequently occurring characters, and causes the coding sequence to be longest for the least frequently occurring characters.  Unlike the Morse code, however, the probabilities for Huffman are usually determined on a message-by-message basis instead of being based on some general overall expectation of the probability of occurrence of the characters.

Message-by-message probabilities

The upside of determining the probabilities on a message-by-message basis is that the encoding can be optimized for each specific message.  The downside is that the probability encoding table used to encode a specific message must also be used to decode the message.  The requirement to transport the encoding table along with the message adds some overhead to the overall process.

An optimal encoding scheme

Once again, according to Wikipedia,

"Huffman was able to design the most efficient compression method of this type: no other mapping of individual source symbols to unique strings of bits will produce a smaller average output size when the actual symbol frequencies agree with those used to create the code."

Some caution is advised

Wikipedia goes on to caution us,

"Assertions of the optimality of Huffman coding should be phrased carefully, because its optimality can sometimes accidentally be over-stated. For example, arithmetic coding ordinarily has better compression capability, because it does not require the use of an integer number of bits for encoding each source symbol. LZW coding can also often be more efficient, particularly when the input symbols are not independently-distributed, because it does not depend on encoding each input symbol one at a time (instead, it batches up a variable number of input symbols into each encoded syntax element). The efficiency of Huffman coding also depends heavily on having a good estimate of the true probability of the value of each input symbol."

As mentioned earlier, some data compression schemes combine Huffman coding with one of the Lempel-Ziv (LZ) coding schemes to get the best of both worlds.

A binary tree

As you will soon see, Huffman coding works by creating a binary tree of nodes, with each node being either a leaf node or an internal node.

(A picture of an actual Huffman binary tree, along with a corresponding example, can be seen at the Binary Essence web site.  Additional Huffman binary trees are also shown in Figure 6, Figure 8, and Figure 10 later in this lesson.)

All nodes are initially leaf nodes, and there is one leaf node for every character in the message being compressed.  Then the leaf nodes are combined with internal nodes to form the tree.

As mentioned above, here is one leaf node for each character in the message.  A leaf node contains the character and the frequency of usage for that character.

Internal nodes contain links to two child nodes plus a frequency which is the sum of the frequencies of the two child nodes.

A variable-length bit sequence

A different, variable-length bit sequence is assigned to each character used in the message.  The specific bit sequence assigned to an individual character is determined by tracing out the path from the root of the tree to the leaf that represents that character.  By convention, bit '0' represents following the left child when tracing out the path and bit '1' represents following the right child when tracing out the path.

Path lengths are different

The tree is constructed such that the paths from the root to the most frequently used characters are short while the paths to less frequently used characters are long.  This results in short codes for frequently used characters and long codes for less frequently used characters.

Preview

The program named Huffman01

In this lesson, I will present and explain a program named Huffman01, which illustrates the encoding and subsequent decoding of a text message using the Huffman encoding algorithm.

Get and use an encoder object

Overall control of this operation of this program takes place in the main method.  I will begin by instantiating an object of the HuffmanEncoder class and by invoking the encode method on that object from the main method.

Create a frequency chart

Inside the encode method, I will invoke the createFreqData method to create a frequency table that identifies each of the individual characters in the original message and the number of times (frequency) that each character appears in the message being compressed.

Create the leaves and construct the tree

Next, I will invoke the createLeaves method to create a HuffLeaf object for each character identified in the frequency table.  I will store the HuffLeaf objects in a TreeSet object.  Each HuffLeaf object encapsulates the character that it represents as well as the number of times that the character appears in the original message (frequency).

Then I will invoke the createHuffTree method to assemble the HuffLeaf objects into a Huffman tree (a HuffTree object).  A Huffman tree is a special form of a binary tree consisting of properly linked HuffNode and HuffLeaf objects.

When the createHuffTree method returns, the HuffTree object will remain as the only object stored in the TreeSet object that previously contained all of the HuffLeaf objects.  This is because all of the HuffLeaf objects will have been combined with HuffNode objects to form the tree.  When two HuffLeaf objects are combined with a single HuffNode object, the two HuffLeaf objects are removed from the TreeSet object, and the HuffNode object is added to the TreeSet object.

Create the bit codes

Following that, I will invoke the createBitCodes method, which uses the Huffman tree in a recursive manner to create a bit code for each character in the message.

(As mentioned earlier, the bit codes are different lengths with the shorter codes corresponding to the characters with a high frequency value and the longer codes corresponding to the characters with the lower frequency values.)

The createBitCodes method populates a data structure that is used to encode the message and is also required later to decode the encoded message.

Dealing with a difficult bit-manipulation challenge

At this point in the execution of the program, I know the variable-length bit code that is required to replace each character in the original message to produce a Huffman-encoded message.

(The compression provided by Huffman encoding depends on the frequently used characters having short bit codes and the less frequently used characters having longer bit codes.)

Although I know the bit code required to replace each character in the original message, a direct transformation from characters in the message to a stream of contiguous bit codes is something of a challenge.  The computer's memory is organized on 8-bit boundaries.  I am unaware of any capability in Java that allows the memory to be viewed simply as a continuous sequence of individual bits.

(Note that it may be possible to accomplish this by using a Java BitSet object.  I may give that a try someday when I have the time.)

This program deals with this challenge in a way that is straightforward, but is probably inefficient from both a speed and memory requirements viewpoint. 

Not a production compression program

This program was specifically designed to serve its primary purpose of education.  No thought or effort was given to speed, efficiency, memory utilization, or any other factor that would be important in a program written for production data compression purposes.  In some cases, the program was purposely made less efficient (in the name of clarity) by using two or more statements to accomplish a task that could be accomplished by a single more complex statement.

The solution to the bit-manipulation challenge

The solution to the bit-manipulation challenge mentioned above was to do a simple table lookup and to create a long String object consisting of only 1 and 0 characters.  Each character in the original message is represented by a substring that matches the required bit code.

This is easy to accomplish because (unlike a long sequence of bits) there are no artificial boundaries requiring the length of the String to be some multiple of a fixed number of characters.

I will invoke the encodeToString method to encode the message into a String representation of the bits that will make up the final encoded message.  After the String containing 1 and 0 characters representing the bits in the Huffman-encoded message is created, this String will be processed to produce the Huffman-encoded message in a binary bit stream format.

Creating the binary bit stream format

Creation of the Huffman-encoded message in a binary bit stream format is accomplished using another lookup table containing 256 entries (the number of possible combinations of eight bits).

I will invoke the buildEncodingBitMap method to populate a lookup table that relates eight bits represented as a String to every possible combination of eight actual bits.  Then I will invoke the encodeStringToBits method to encode the String representation of the bits that make up the encoded message into the actual bits that make up the encoded message.

Extraneous characters at the end

The encodeStringToBits method doesn't handle the end of the message very gracefully for those cases where the number of required bits is not a multiple of 8.  The method simply adds enough "0" characters to the end of the String to cause the length to be a multiple of 8.  This will usually result in extraneous characters at the end of the decoded message later.

Some mechanism must be found to eliminate the extraneous characters when decoding the message later.  This program assumes that the length of the original message is preserved and provided to the decoding software along with the required decoding table.  Since the length of the decoded message must match the length of the original message, this value is used to eliminate extraneous characters at the end of the decoded message.

Having created the Huffman-encoded message in a binary bit stream format, I will return the encoded message from the encode method back to the main method.

Hexadecimal display

At this point in the program, the message has been Huffman encoded.  Back in the main method, I will provide the capability to display the binary encoded data in Hexadecimal format for comparison with the original message.

Decode the encoded message

The program continues the demonstration by decoding and displaying the Huffman-encoded message.

I will begin the decoding process by instantiating a HuffmanDecoder object from within the main method.  Then I will invoke the decode method on the HuffmanDecoder object to decode the message.

I will pass the encoded message along with a reference to a data structure containing encoding particulars and the length of the original message to the decode method so that extraneous characters on the end can be eliminated.

Decode from binary to String representation

Inside the decode method, I will invoke the buildDecodingBitMap method to create a decoding bit map, which is essentially the reverse of the encoding bit map that was used to encode the original message.

I will invoke the decodeToBitsAsString method to decode the encoded message from a binary bit stream representation to a String of 1 and 0 characters that represent the actual bits in the encoded message.

Decode from string representation back to the original characters

I will invoke the buildHuffDecodingTable method to create a Huffman decoding table by swapping the keys and the values from the Huffman encoding table received as an incoming parameter by the decode method.

Finally, I will invoke the decodeStringBitsToCharacters method to decode the String containing only 1 and 0 characters that represent the bits in the encoded message.  This will produce a replica of the original message that was subjected to Huffman encoding.

Remove extraneous characters and return the decoded message

I will write the resulting decoded message into a String object and return the String with any extraneous characters at the end having been removed.

Display results at critical points in the process

Numerous opportunities will be provided to enable code that will display information that is useful towards gaining an understanding of the Huffman encoding algorithm.  I will present and discuss much of that information in this lesson.

Program testing

The program was tested using J2SE 5.0 and WinXP.  The program requires J2SE 5.0 or later to support generics.

Sample messages

Three test messages are hard-coded into the program.  You can switch among those messages by enabling and disabling them using comments and by then recompiling the program.  You can also insert your own test message and recompile the program to see the result of compressing your message.

Discussion and Sample Code

The class named Huffman01

I will discuss this program in fragments.  You can view a complete listing of the program in Listing 45 near the end of the lesson.

The class definition for the class named Huffman01 along with the main method begins in Listing 1.

public class Huffman01{
  
  public static void main(String[] args){

    Hashtable <Character,String>huffEncodeTable;

Listing 1

Listing 1 declares a data structure that is used to communicate encoding particulars from the Huffman encoder to the Huffman decoder.  Because the encoding particulars are different for every message, this is necessary for the decoder to be able to decode the encoded message.

Test messages

Listing 2 creates and displays the raw test message that will be encoded.  The message is displayed 48 characters to the line.

/*
    //The following test message was copied directly from
    // an Internet news site.  It is probably
    // representative of typical English text.
    String rawData = "BAGHDAD, Iraq Violence increased "
    + "across Iraq after a lull following the Dec. 15 "
    + "parliamentary elections, with at least two dozen "
    + "people including a U.S. soldier killed Monday in "
    + "shootings and bombings mostly targeting the Shiite-"
    + "dominated security services. The Defense Ministry "
    + "director of operations, Brig. Gen. Abdul Aziz "
    + "Mohammed-Jassim, blamed increased violence in the "
    + "past two days on insurgents trying to deepen the "
    + "political turmoil following the elections. The "
    + "violence came as three Iraqi opposition groups "
    + "threatened another wave of protests and civil "
    + "disobedience if allegations of fraud are not "
    + "properly investigated.";
*/
/*
    String rawData = "Now is the time for all good men "
    + "to come to the aid of their country.";
*/

    //Use the following test message or some other 
    // similarly short test message to illustrate the
    // construction of the HuffTree object.
    String rawData = "AAAAABBBBCCCDDE";

    System.out.println("Raw Data");
    display48(rawData);

Listing 2

As mentioned earlier, you can modify the comment indicators to enable any one of the test messages in Listing 2, or you can insert a test message of your own and then recompile the program.

Listing 2 invokes the utility method named display48 to display the message 48 characters to the line.  That method is straightforward and shouldn't require an explanation.  You can view the method in its entirety in Listing 45.

Program output

The text at the top of Figure 1 was produced by the code in Listing 2 after modifying the comment indicators to enable the news story shown in Listing 2.

Raw Data
BAGHDAD, Iraq Violence increased across Iraq aft
er a lull following the Dec. 15 parliamentary el
ections, with at least two dozen people includin
g a U.S. soldier killed Monday in shootings and
bombings mostly targeting the Shiite-dominated s
ecurity services. The Defense Ministry director
of operations, Brig. Gen. Abdul Aziz Mohammed-Ja
ssim, blamed increased violence in the past two
days on insurgents trying to deepen the politica
l turmoil following the elections. The violence
came as three Iraqi opposition groups threatened
another wave of protests and civil disobedience
if allegations of fraud are not properly invest
igated.

Number raw data bits: 5048
Number binary encoded data bits: 2816
Compression factor: 1.7926136363636365

Binary Encoded Data in Hexadecimal Format
b889531b2812a0668f6b9c2eaf9759dceb2c86d996148
9a3c1ae734124336fabffe82^76Lempel77^7e9f4e50563e1a8
778c1fe684bcb0b8fcae7a9d0668dd3^81ssim82^2c93a6356b
54717de1af3e773^84Ziv85^3c4f92bf6e2186fdff
9b689a7b30bb9f27b553cf7a667b68b5c5e7bd343493f17
4b5e679efd3e9d11ef74911b543ced26d90ad50e91721b
bcb283e52f4e50602e8744f3f248badc057341aa0d5e16
9ea7419ae21d78f94cb8f84c5b2bfc263ba3b44d7ad0c26c
2235644e84668bf684db73acb21b6bbd5f2eb3b9f4fa778
c93a6356b3a9a9f73a2a179794d202f3dfa6b589fbe9f4
^104LZ105LZ106GIF107108df109Lempel110^7e9f4e3f2b9ea741f
297a75deaf9759cb684e64d3e813a3c1ae7b579e5274f53e
bc295f134fa05a4b9b667a9f4868cd74ea8378152525333d
b2faef7f5b92a29b71759dd066ffcded3d4e9aa0d032b6c
c073d4^117LZ118^21f1773dd293d7b49be

Decoded Data
BAGHDAD, Iraq Violence increased across Iraq aft
er a lull following the Dec. 15 parliamentary el
ections, with at least two dozen people includin
g a U.S. soldier killed Monday in shootings and
bombings mostly targeting the Shiite-dominated s
ecurity services. The Defense Ministry director
of operations, Brig. Gen. Abdul Aziz Mohammed-Ja
ssim, blamed increased violence in the past two
days on insurgents trying to deepen the politica
l turmoil following the elections. The violence
came as three Iraqi opposition groups threatened
another wave of protests and civil disobedience
if allegations of fraud are not properly invest
igated.
Figure 1

A compression factor of 1.793

The statistics following the test message in Figure 1 show that the compression factor achieved by the Huffman algorithm for this particular message was 1.793.  This can also be thought of as a compression ratio of 0.558.  In other words, the compressed message requires 55.8 percent of the number of bits required by the original uncompressed message.

As you will see later, different messages result in different compression factors.

The binary encoded test message

The block of text near the middle of Figure 1 shows the binary version of the encoded message displayed in Hexadecimal format.  At the surface, this may appear to be longer than the original message.  Recall, however, that in the display of the original message at the top of Figure 1, each character represents eight bits.  However, in the Hexadecimal display, each character represents only four bits.  As mentioned above, the number of actual bits in the compressed message was only 55.8 percent of the number of bits in the original message.

The decoded message

The bottom block of text in Figure 1 is a replica of the original message that was produced by decoding the binary version of the encoded message.  Hopefully, this is an exact copy of the original message at the top of Figure 1, which is a requirement for a lossless compression algorithm.

Display raw data length

Listing 3 gets and displays the length of the test message in bits, as shown in the upper half of Figure 1.

    int rawDataLen = rawData.length();
    
    System.out.println("nNumber raw data bits: " 
                                   + rawData.length() * 8);

Listing 3

Instantiate a HuffmanEncoder object

Listing 4 instantiates a new object of the HuffmanEncoder class.  The instance method named encode belonging to that object will be used to encode the test message using the Huffman compression algorithm.

    HuffmanEncoder encoder = new HuffmanEncoder();

Listing 4

Encode the Message

Listing 5 begins by instantiating a new Hashtable object that will be passed to the encode method to be populated with encoding particulars.  This object will be used to encode the message and will also be required later to decode the message.

Still in the main method, Listing 5 invokes the encode method of the HuffmanEncoder object to perform the actual encoding.  (Control is transferred from the main method to the encode method.)

The test message and the Hashtable mentioned above are passed as parameters to the encode method.

    huffEncodeTable = new Hashtable<Character,String>();

    ArrayList<Byte> binaryEncodedData = encoder.encode(
                                  rawData,huffEncodeTable);

Listing 5

The encoded message is received back later as bytes stored in an ArrayList object.

The HuffmanEncoder class

At this point, I am going to set the main method aside and explain the encode method of the HuffmanEncoder class.  I will return to the discussion of the main method later.

The HuffmanEncoder class definition begins in Listing 6.  An object of this class can be used to encode a message using the Huffman encoding algorithm.

class HuffmanEncoder{
  String rawData;  
  TreeSet <HuffTree>theTree = new TreeSet<HuffTree>();
  ArrayList <Byte>binaryEncodedData = 
                                     new ArrayList<Byte>();
  Hashtable <Character,Integer>frequencyData = 
                        new Hashtable<Character,Integer>();
  StringBuffer code = new StringBuffer();
  Hashtable <Character,String>huffEncodeTable;
  String stringEncodedData;
  Hashtable <String,Byte>encodingBitMap = 
                              new Hashtable<String,Byte>();

Listing 6

Declaration of instance variables

Listing 6 contains the declaration of several instance variables along with the initialization of some of them.  I will discuss the instance variables in conjunction with the code that uses them later.

(By the way, in case you are unfamiliar with the syntax of the boldface statement in Listing 6 that declares a reference to and instantiates a new TreeSet object, see my earlier lesson entitled Generics in J2SE 5.0, Getting Started.)

The encode method

The encode method begins in Listing 7.

  ArrayList<Byte> encode(
              String rawData,
              Hashtable <Character,String>huffEncodeTable){
    //Save the incoming parameters.
    this.rawData = rawData;
    this.huffEncodeTable = huffEncodeTable;

Listing 7

The encode method encodes an incoming String message using the Huffman encoding algorithm.  The method also receives a reference to an empty data structure of type Hashtable.  This data structures is populated with encoding particulars.  These encoding particulars are used to encode the message.  They are also required later by the decode method to decode and transform the encoded message back into the original String version.

In order to keep this method simple, pad characters may be appended onto the end of the original message when it is encoded.  This is done to cause the number of bits in the encoded message to be a multiple of eight, thus causing the length of the encoded message to be an integral number of bytes.  Additional code would be required to avoid this at this point.  However, it is easy to eliminate the extraneous characters during decoding if the length of the original message is known.

The code in Listing 7 saves the incoming parameters in a pair of local variables.

Display original message as bits

Listing 8 shows the first of several opportunities throughout this program to remove comment indicators and cause information of interest be displayed.

/*
    System.out.println("nRaw Data as Bits");
    displayRawDataAsBits();
*/

Listing 8

By removing the comment indicators to enable the two statements shown in Listing 8, you can display the original message as a stream of bits.  This can be visually compared with a similar display for the encoded message later to illustrate the amount of compression provided by the encoding process.

Display the message as a stream of bits

The bottom portion of Figure 2 shows the result of enabling these two statements and running the program using the test message shown at the top of Figure 2.  (This test message is much shorter than the test message from Figure 1.)

Raw Data
Now is the time for all good men to come to the
aid of their country.

Number raw data bits: 552

Raw Data as Bits

010011100110111101110111001000000110100101110011
001000000111010001101000011001010010000001110100
011010010110110101100101001000000110011001101111
011100100010000001100001011011000110110000100000
011001110110111101101111011001000010000001101101
011001010110111000100000011101000110111100100000
011000110110111101101101011001010010000001110100
011011110010000001110100011010000110010100100000
011000010110100101100100001000000110111101100110
001000000111010001101000011001010110100101110010
001000000110001101101111011101010110111001110100
011100100111100100101110
Figure 2

The method named displayRawDataAsBits

The utility method named displayRawDataAsBits, which is invoked in Listing 8, is relatively straightforward and shouldn't require an explanation.  You can view the method in its entirety in Listing 45.

Create a frequency chart

Listing 9 invokes the method named createFreqData to create a frequency chart that identifies each of the individual characters in the original message and the number of times (frequency) that each character appears in the message.

    createFreqData();
    //For illustration purposes only, enable the following
    // statement to display the contents of the frequency
    // chart created above.
/*
    displayFreqData();
*/

Listing 9

Listing 9 also optionally invokes the method named displayFreqData to display the results.

The createFreqData method

Listing 10 shows the createFreqData in its entirety.

  void createFreqData(){
    for(int cnt = 0;cnt < rawData.length();cnt++){
      char key = rawData.charAt(cnt);
      if(frequencyData.containsKey(key)){
        int value = frequencyData.get(key);
        value += 1;
        frequencyData.put(key,value);
      }else{
        frequencyData.put(key,1);
      }//end else
    }//end for loop
  }//end createFreqData

Listing 10

The createFreqData method creates the frequency chart described above.  The results are stored in a Hashtable with the characters being the keys and the usage frequency values of each character being the corresponding Hashtable values for those key.

The code in Listing 10 is relatively straightforward and shouldn't require a detailed explanation.

The displayFreqData method

The method named displayFreqData that is called in Listing 9 can be viewed in its entirety in Listing 45.  It is too simple to require an explanation.

The frequency data

The bottom portion of Figure 3 shows the frequency data for each of the characters contained in the test message shown at the top of Figure 3.

Raw Data
Now is the time for all good men to come to the
aid of their country.

Number raw data bits: 552

Frequency Data
. 1
  15
N 1
y 1
w 1
u 1
t 7
s 1
r 3
o 9
n 2
m 3
l 2
i 4
h 3
g 1
f 2
e 6
d 2
c 2
a 2
Figure 3

And the results are...

As you can see, for this message, the space character occurred most frequently for a total of 15 occurrences.  The character 'o' occurred next most frequently with nine occurrences followed by the 't' with seven occurrences.  The period character and the characters 'N', 'y', 'w', 'u', 's', and 'q' tied for last place with only one occurrence each.

Three special classes

At this point, I need to explain the following three special classes in order for everything that follows to make sense:

  • HuffTree
  • HuffLeaf
  • HuffNode

The HuffTree class

The HuffTree class is the abstract superclass of the other two classes in the above list.  Objects of the HuffNode and HuffLeaf classes are used to construct the binary tree mentioned earlier.

The class definition for the HuffTree class begins in Listing 11.  Note that this class implements the Comparable interface.

abstract class HuffTree implements Comparable{

  int frequency;
  
  public int getFrequency(){
    return frequency;
  }//end getFrequency

Listing 11

The Comparable interface

In case you are unfamiliar with the use and purposes of the Comparable interface, see Part 1 and Part2 my earlier lesson on that topic.  Pay particular attention to the discussion of the requirement for objects that will be stored in a TreeSet object to implement the interface and to define the compareTo method.

The frequency property

Listing 11 declares the property variable named frequency.

Listing 11 also provides the property method named getFrequency, which when called on an object of the class, will return the value of the frequency property.

(In the event that you are unfamiliar with properties in Java, see the lessons on properties in the References section.)

The compareTo method

The method named compareTo is declared by the Comparable interface and therefore must be defined by the HuffTree class.  Here is some of what Sun has to say about the method:

"Compares this object with the specified object for order. Returns a negative integer, zero, or a positive integer as this object is less than, equal to, or greater than the specified object."

Listing 12 shows the definition of the compareTo method in the HuffTree class.  Along with the property named frequency, the compareTo method is inherited by both HuffNode and HuffLeafListing 12 also signals the end of the HuffTree class definition.

  public int compareTo(Object obj){
    HuffTree theTree = (HuffTree)obj;
    if (frequency == theTree.frequency){
      //The objects are in a tie based on the frequency
      // value.  Return a tiebreaker value based on the
      // relative hashCode values of the two objects.
      return (hashCode() - theTree.hashCode());
    }else{
      //Return negative or positive as this frequency is
      // less than or greater than the frequency value of
      // the object referred to by the parameter.
      return frequency - theTree.frequency;
    }//end else
  }//end compareTo

}//end HuffTree class

Listing 12

Purpose and description of the compareTo method

The purpose of the compareTo method in this program is to make it possible for a collection object of the TreeSet class to compare two objects of the HuffTree class (HuffNode or HuffLeaf objects) to determine which is greater when sorting the objects into ascending order.

The compareTo method compares the object on which it is invoked to another object whose reference is received as an incoming parameter.  The method guarantees that sorting processes that depend on this method, such as TreeSet objects, will sort the objects into a definitive order.

If the frequency property values of the two objects are different, the sort is based on the frequency values.

If the frequency values of the two objects are equal, the objects are sorted based on their relative hashCode values.  Thus, if the same two objects with the same frequency value are compared two or more times during the execution of the program, those two objects will always be sorted into the same order.  There is no chance of an ambiguous tie as to which object should be first except for the case where an object is compared to itself using two references to the same object.

The HuffNode class

The inner class named HuffNode is used to construct a node object in the Huffman tree.  The class definition is shown in its entirety in Listing 13.

  class HuffNode extends HuffTree{
  
    private HuffTree left;
    private HuffTree right;
  
    //HuffNode constructor
    public HuffNode(
               int frequency,HuffTree left,HuffTree right){
      this.frequency = frequency;
      this.left = left;
      this.right = right;
    }//end HuffNode constructor
  
    public HuffTree getLeft(){
      return left;
    }//end getLeft
  
    public HuffTree getRight(){
      return right;
    }//end getRight
  
  }//end HuffNode class

Listing 13

HuffNode extends HuffTree

As you should expect from the previous discussion, the HuffNode class extends the HuffTree class.  The class declares two instance variables named left and right.  The purpose of these instance variables can best be explained by referring to the picture of the Huffman binary tree on the Binary Essence web site.

The HuffNode class is used to create internal node objects in the Huffman tree. (See the node labeled 38 on the Binary Essence web site for example.)  Except for the root node, each such internal node is either a left or right child of a parent node.  Also, each such node holds references to two child objects, which may be either HuffNode objects or HuffLeaf objects.  The references to those two child objects are stored in the instance variables named left and right in Listing 13.

(The two instance variables named left and right can hold references to HuffNode objects or HuffLeaf objects because they are declared as type HuffTree.   HuffTree is the abstract superclass of both HuffNode and HuffLeaf.)

Beyond the explanation given above, the HuffNode code shown in Listing 13 is pretty straightforward and shouldn't require further explanation.

The HuffLeaf class

The inner class named HuffLeaf is used to construct a leaf object in the Huffman tree.  The class definition for the HuffLeaf class is shown in its entirety in Listing 14.

  class HuffLeaf extends HuffTree{
    
    private int value;
    
    //HuffLeaf constructor
    public HuffLeaf(int value, int frequency){
      this.value = value;
      //Note that frequency is inherited from HuffTree
      this.frequency = frequency;
    }//end HuffLeaf constructor
    
    public int getValue(){
      return value;
    }//end getValue
  
  }//End HuffLeaf class

Listing 14

Once again, as you should expect from the previous discussion, the HuffLeaf class extends the HuffTree class.

Two instance variables

The class declares a single instance variable named value and inherits another instance variable named frequency from the HuffTree class.  The purpose of these instance variables can once again be explained by referring to the picture of the Huffman binary tree on the Binary Essence web site.

The HuffLeaf class is used to create leaf objects in the Huffman tree. (See the leaf nodes labeled A, B, C, D, and E on the Binary Essence web site for example.)  Each such leaf node is either a left or right child of a parent node of type HuffNode.

(Leaf node B is the left child of the parent node labeled 22 and leaf node C is the right child of the parent node labeled 22.) 

Leaf nodes do not hold references to child objects.  Rather, they hold two values.  One value is the representation of a character or symbol that appears in the message being compressed.  The other value is the frequency, or the number of times that the character appears in the message being compressed.  These two values are stored in the instance variables named frequency and value belonging to an object of the HuffLeaf class.

HuffNode objects also contain a value

I didn't mention it in the earlier discussion of the HuffNode objects because it didn't seem to be germane at that point.  However, in addition to the two references to child objects that are held by HuffNode objects, HuffNode objects also hold a frequency value.  In the case of a HuffNode object, the frequency value is the sum of the frequency values of its two child objects.

For example, in the picture of the Huffman binary tree on the Binary Essence web site, the internal node labeled 22 holds a frequency value of 22, which is the sum of the frequency values of the child nodes labeled B and C.  The frequency value held by child node B is 12, and the frequency value held by child node C is 10.  The sum of these two frequency values is 22, the value held by their parent node.

The character values held by those two child nodes are respectively B and C.

Now that we understand the HuffTree, HuffNode, and HuffLeaf classes, I can resume the discussion of the encode method of the HuffmanEncoder class.

Create a set of HuffLeaf objects

Returning now to the discussion of the encode method, Listing 15 invokes the createLeaves method to create a HuffLeaf object for every character identified in the frequency chart that was created in Listing 9.

    createLeaves();

Listing 15

The createLeaves method

The createLeaves method is shown in its entirety in Listing 16.

  void createLeaves(){
    Enumeration <Character>enumerator = 
                                      frequencyData.keys();
    while(enumerator.hasMoreElements()){
      Character nextKey = enumerator.nextElement();
      theTree.add(new HuffLeaf(
                      nextKey,frequencyData.get(nextKey)));
    }//end while
  }//end createLeaves

Listing 16

As mentioned above, the createLeaves method creates a HuffLeaf object for every character identified in the frequency chart.  The HuffLeaf objects are stored in a TreeSet object.  Each HuffLeaf object encapsulates the character as well as the number of times that the character appears in the original message that is being compressed.

(In case you are unfamiliar with the Enumeration interface, see my earlier lesson entitled Vectors, Hashtables, and Enumerations.)

Create the Huffman tree

Listing 17 invokes the createHuffTree method to assemble the collection of HuffLeaf objects stored in the TreeSet object into a Huffman tree (a HuffTree object), also stored in the same TreeSet object.

    createHuffTree();

Listing 17

A Huffman tree is a special form of a binary tree consisting of properly linked HuffNode objects and HuffLeaf objects.

When the createHuffTree method in Listing 17 returns, the HuffTree object remains as the only object stored in the TreeSet object that previously contained all of the HuffLeaf objects.  This is because all of the HuffLeaf objects have been combined with HuffNode objects to form the single HuffTree object.

The createHuffTree method

The createHuffTree method begins in Listing 18.

  void createHuffTree(){
    
    //Enable the following statements to see the original
    // contents of the TreeSet object. Do this only for
    // small trees because it generates lots of output.
/*     
    System.out.println("nnDisplay Original TreeSet");
    Iterator <HuffTree> originalIter = theTree.iterator();
    while(originalIter.hasNext()){
      System.out.println(
                      "nHuffNode, HuffLeaf, or HuffTree");
      displayHuffTree(originalIter.next(),0);
    }//end while loop
    //End code to display the TreeSet
*/

Listing 18

When enabled, the code in Listing 18 displays the contents of the TreeSet object before the effort is begun to combine the HuffLeaf objects with HuffNode objects to create the HuffTree object.

(Note the caution in Listing 18 against enabling this display code for large trees, meaning messages that contain lots of different characters.)

A short example

Figure 4 shows the output produced by the code in Listing 18 (and the code from some earlier listings) for the short message shown at the top of Figure 4.

Raw Data
AAAAABBBBCCCDDE

Number raw data bits: 120

Frequency Data
A 5
E 1
D 2
C 3
B 4


Display Original TreeSet

HuffNode, HuffLeaf, or HuffTree
Leaf:E
Back
HuffNode, HuffLeaf, or HuffTree
Leaf:D
Back
HuffNode, HuffLeaf, or HuffTree
Leaf:C
Back
HuffNode, HuffLeaf, or HuffTree
Leaf:B
Back
HuffNode, HuffLeaf, or HuffTree
Leaf:A
Back
Figure 4

Five HuffLeaf objects

The frequency chart for this message appears near the top of Figure 4.  From this frequency chart (and also from the text of the test message at the top), we can see that the TreeSet object should contain five HuffLeaf objects encapsulating the character values A, B, C, D, and E.

The bottom portion of Figure 4 shows the five HuffLeaf objects.

(I manually colored the individual elements with alternating colors of blue and red to make them visually distinguishable.  I will have more to say about the display format later.)

At this point, it important to note that the five HuffLeaf objects have been sorted into ascending order based on their frequency values as shown in the frequency chart near the top of Figure 4.  I will be referring back to that fact later.

The displayHuffTree method

Listing 18 invokes the displayHuffTree method to produce the output shown in the bottom portion of Figure 4.  I will be calling that method again later in a more significant way (insofar as the output format is concerned) so I will defer any further discussion until then.

Create the Huffman tree

The code that creates the Huffman tree begins in Listing 19.  Before getting into the details of the code, here is a general description of the behavior of the code that creates the Huffman tree

Overall, the code assembles the collection of HuffLeaf objects into a HuffTree object.  A HuffTree object is a special form of a binary tree consisting of properly linked HuffNode objects and HuffLeaf objects.

When the operation has been completed, the HuffTree object remains as the only object stored in the TreeSet object that previously contained all of the HuffLeaf objects.  This is because, at that point in the execution of the code, all of the HuffLeaf objects have been removed from the TreeSet object and combined with HuffNode objects to form the Huffman tree (as represented by the single HuffTree object).

Displays at runtime

The createHuffTree method contains two sections of code that can be enabled to display:

  1. The contents of the original TreeSet object.
  2. The contents of the TreeSet object for each iteration during which HuffLeaf objects are being combined with HuffNode objects to form the final HuffTree object.

You have already seen an example of the first display in Figure 4.  You will see an example of the second display later in Figure 5.

These displays are very useful for understanding how the Huffman tree is actually constructed.

Steps in constructing the HuffTree object

The HuffTree object is constructed by performing the following operations:

  1. Extracting pairs of HuffLeaf and/or HuffNode objects from the TreeSet object in ascending order based on their frequency values.
  2. Using the pair of extracted objects to construct a new HuffNode object where the two extracted objects become children of the new HuffNode object, and where the frequency value stored in the new HuffNode object is the sum of the frequency values in the two child objects.
  3. Removing the two original HuffLeaf and/or HuffNode objects from the TreeSet object and adding the new HuffNode object to the TreeSet object.  The position of the new HuffNode object in the sorted TreeSet object is determined by its frequency value relative to the other HuffNode and/or HuffLeaf objects in the collection.  The new HuffNode object will eventually become a child of another new HuffNode object unless it ends up as the root of the HuffTree object.
  4. Continuing this process until the TreeSet object contains a single object of type HuffTree.

Iterate on the TreeSet object

Listing 19 shows the beginning of a while loop that iterates on the TreeSet object until the number of elements contained in the object is equal to 1.  When the number of elements in the TreeSet object has been reduced to 1, all of the HuffNode and HuffLeaf elements in the collection will have been combined into a single element of type HuffTree.

    while(theTree.size() > 1){
      //Get, save, and remove the first two elements.
      HuffTree left = theTree.first();
      theTree.remove(left);
      HuffTree right = theTree.first();
      theTree.remove(right);

Listing 19

As we saw in Figure 4, the collection of HuffLeaf objects in the TreeSet object is initially sorted into ascending order based on the frequency values encapsulated in the objects.

The code in Listing 19 removes and saves the first two objects in the collection, which are the objects containing the lowest frequency values.

(Initially, these two objects are both HuffLeaf objects, but later they may be either HuffLeaf objects or HuffNode objects or both.)

Create and save a HuffNode object

Listing 20 combines the two saved objects into a new HuffNode object (forming a three-object sub tree) and adds the sub tree to the TreeSet object.

      HuffNode tempNode = new HuffNode(left.getFrequency() 
                        + right.getFrequency(),left,right);
      theTree.add(tempNode);

Listing 20

Thus, each pass through the while loop removes the first two elements from the TreeSet collection, uses those two elements to create a sub tree, and adds the sub tree back to the collection.  This process continues until the sub tree is the final tree, at which point the number of elements in the collection has been reduced to one and the while loop terminates.

Display the contents of the TreeSet object

Still in the while loop, Listing 21 contains code that can be enabled to display the contents of the TreeSet object at the end of each iteration.

      //Enable the following statements to see the HuffTree
      // being created from HuffNode and HuffLeaf objects.
      // Do this only for small trees because it will
      // generate a lot of output.
/*      
      System.out.println("nnDisplay Working TreeSet");
      Iterator <HuffTree> workingIter = theTree.iterator();
      while(workingIter.hasNext()){
        System.out.println(
                      "nHuffNode, HuffLeaf, or HuffTree");
        displayHuffTree(workingIter.next(),0);
      }//end while loop
      //End code to display the TreeSet
*/    
    }//end while
  }//end createHuffTree

Listing 21

Listing 21 also signals the end of the while loop and the end of the createHuffTree method.

The program output

The bottom portion of Figure 5 shows the output produced during the first iteration of the while loop by enabling the display code in Listing 21.

Raw Data
AAAAABBBBCCCDDE

Number raw data bits: 120


Display Working TreeSet

HuffNode, HuffLeaf, or HuffTree
Leaf:C
Back
HuffNode, HuffLeaf, or HuffTree
Left to 1 Leaf:E
Back Right to 1 Leaf:D
Back Back
HuffNode, HuffLeaf, or HuffTree
Leaf:B
Back
HuffNode, HuffLeaf, or HuffTree
Leaf:A
Back
Figure 5

At the end of the first iteration...

At the end of the first iteration of the while loop, the collection contains four different elements.  (Recall from Figure 4 that the collection originally contained five elements.)

I colored the four elements in Figure 5 with alternating colors of blue and red to make them visually distinguishable.

In Figure 4, the first two elements in the collection were the HuffLeaf objects encapsulating the characters E and D.  In Figure 5, those two elements have been removed and combined with a HuffNode object to form a sub tree, which is shown as the first red element in Figure 5.

The sub tree

Figure 6 is an attempt to show a picture of the sub tree, which is the second (red) element in the collection shown in Figure 5(Hopefully this picture will survive the various publishing programs to which this lesson will be subjected)


           Root  
Level 0     O  
           /   
Level 1   E   D  
      
Figure 6

The sub tree consists of three parts.  The root of the sub tree is a HuffNode object (shown by the O in Figure 6).  The left child of the node is a HuffLeaf object (shown by the E in Figure 6).  The right child of the node is also a HuffLeaf object (shown by the D in Figure 6).

The format of the printed description

Conceptually, we can think of the root node as existing at Level 0 in the tree and the two child nodes existing at Level 1 in the tree.  That brings us to the format used to describe the elements in Figure 5.

Consider the first line in the description of the sub tree in Figure 5, which reads HuffNode, HuffLeaf, or HuffTree.  This line simply serves as a separator between the elements.

Beginning with the second line in the description of the sub tree, the text is a verbal description of how to start at the root and to traverse the tree.  This text describes the sub tree by the traversal path.  It says:

  • Starting at the root (Level 0 implied), go down and to the left to Level 1.  There you will find a Leaf that encapsulates the character E.
  • Then go back up one level.  This will return you to Level 0.  Go down and to the right to Level 1.  There you will find a Leaf that encapsulates the character D.
  • Then go back up two levels, which will pop you out of the top of the tree.

Adding the sub tree to the TreeSet collection

After the E and D leaves were removed from the collection and combined with a node to form a sub tree, that sub tree was added back to the collection as shown in Listing 20.

The node that forms the root of the sub tree was assigned the frequency value 3, which is the sum of the frequency values for D and E as shown in the frequency chart in Figure 4.  As a result, the frequency value for the root of the sub tree has the same frequency value as the leaf for the character C.

Tied for position in the collection

This causes the new sub tree to be tied for position in the sorted collection with the leaf for the character C.  It also causes the frequency value for the sub tree to be less than the frequency value for the B leaf which has a frequency value of 4.

Given these values, the sub tree could have ended up as either the first or the second element in the collection shown in Figure 5, because it tied for the first position with the C leaf.  However, it had to be closer to the beginning than the B leaf.  As it turns out, the tie-breaker methodology used in the compareTo method in Listing 12 placed it after the C leaf in Figure 5

Output following the second iteration

Figure 7 shows the state of the TreeSet object following the completion of the second iteration of the while loop that began in Listing 19.

HuffNode, HuffLeaf, or HuffTree
Leaf:B
Back
HuffNode, HuffLeaf, or HuffTree
Leaf:A
Back
HuffNode, HuffLeaf, or HuffTree
Left to 1 Leaf:C
Back Right to 1 Left to 2 Leaf:E
Back Right to 2 Leaf:D
Back Back Back
Figure 7

At this point, the number of elements in the collection has been reduced from the four elements shown in Figure 5 to the three shown in Figure 7.  The reduction was accomplished by removing the C leaf in the first element and the sub tree from the second element in Figure 5 and using them to construct a larger sub tree.  That larger sub tree is shown as the last element in Figure 7.

(See if you can create a sketch of the new sub tree on the basis of the traversal description in Figure 7.)

The new sub tree

The pictorial representation of this larger sub tree, shown as the last element in the collection in Figure 7, is shown in Figure 8.

                Root
Level 0          O
                / 
Level 1        C   O
                  / 
Level 2          E   D
Figure 8

The Huffman tree

Skipping ahead to the last iteration, Figure 9 shows the state of the TreeSet object following the final iteration of the while loop.  At this point, the collection contains only one element, and it is the Huffman tree produced by combining all of the leaf elements shown in Figure 4 with nodes to produce the tree.

HuffNode, HuffLeaf, or HuffTree
Left to 1 Left to 2 Leaf:C
Back Right to 2 Left to 3 Leaf:E
Back Right to 3 Leaf:D
Back Back Back Right to 1 Left to 2 Leaf:B
Back Right to 2 Leaf:A
Back Back Back
Figure 9

Pictorial representation of the Huffman tree

The pictorial representation of the Huffman tree described in Figure 9 is shown in Figure 10

                     Root
Level 0               X
                     / 
                   0/   1
                   /     
                  /       
Level 1          X         X
               0/ 1     0/ 1
Level 2        C   X     B   A
                 0/ 1
Level 3          E   D
Figure 10

In Figure 10, I used the character X to indicate an internal node in place of the character O that was used to depict the internal nodes in the earlier pictures of the trees.  This is because I wanted to decorate the tree in Figure 10 with 0 and 1 characters and I wanted to avoid confusion between the character for an internal node and the zero character.

(You will see the reason for the decoration of the tree using 0 and 1 characters later.)

Hopefully you now understand how the TreeSet object containing a collection of HuffLeaf objects is processed to produce a single HuffTree object, which is the required Huffman tree.

The displayHuffTree method

The output shown in Figure 4, Figure 5, Figure 7, and Figure 9  was produced by the invocation of the displayHuffTree method in Listing 18 and Listing 21.  You can view the displayHuffTree method in its entirety in Listing 45.  This is a recursive method, which is very similar to another recursive method named createBitCodes  that I will discuss later, so I will leave it up to you to understand the inner workings of the displayHuffTree method on your own at this point.

Recap

At this point, I have developed the frequency information for the characters in the short test message at the top of Figure 4.  I displayed the frequency information in Figure 4.

I used the frequency information to create a HuffLeaf object for every character in the message where the HuffLeaf object encapsulates the character and the frequency information for that character.  I displayed the HuffLeaf objects in Figure 4 also.

Then I combined those HuffLeaf objects with HuffNode objects to create a HuffTree object.  I displayed intermediate pictorial versions of the HuffTree object in Figure 6 and Figure 8, and displayed the final version in Figure 10Figure 10 shows the Huffman binary tree for the short test message at the top of Figure 4.

If we compare the Huffman tree in Figure 10 with the frequency information in Figure 4, we see that the leaf objects representing the characters with the highest frequency of usage in the message (C, B, and A) occur closest to the root of the tree (Level 2) and those with the lowest frequency of usage (E and D) occur further from the root of the tree (Level 3).

Traversing the Huffman tree

The next task is to use the Huffman tree to create a unique binary bit code for every character used in the original test message.

(That is the reason that I decorated the tree in Figure 10 with 0 and 1 characters.)

The way that I will create the binary codes is to traverse the path from the root of the binary tree to each leaf of the tree, keeping track of the decorations that I encounter along the way.  Each time I encounter a 0, I will append a 0 to the bit code for that character.  Each time I encounter a 1, I will append 1 to the bit code for that character.

Manually generated bit codes

Figure 11 shows the result of manually performing this operation for each of the leaves in Figure 10 in left-to-right order for the leaves.

C 00
E 010
D 011
B 10
A 11
Figure 11

Now that you know how the task is accomplished, we will take a look at the code that automates this process.

Automate the task of creating bit codes

Back in the encode method, Listing 22 invokes the method named createBitCodes to automate the task of traversing the Huffman tree for the purpose of creating a unique bit code for every character described by a leaf of the tree.  A reference to the single HuffTree object stored in the TreeSet object is passed to the createBitCodes method.

    createBitCodes(theTree.first());

Listing 22

A recursive method

The createBitCodes method traverses the Huffman tree in a recursive manner to create a bit code for each character in the message.  The bit codes are different lengths with the shorter codes corresponding to the characters with a high frequency value and the longer codes corresponding to the characters with the lower frequency values.

Note that the method call extracts the reference to the Huffman tree from the TreeSet object and passes that reference to the method.  This is necessary because the createBitCodes method is recursive and can't conveniently work directly with the TreeSet object.

The createBitCodes method populates the data structure that is used to encode the message and required later to decode the encoded message.

Recursion in Java

In order to understand the method named createBitCodes, you must understand recursion.  Unfortunately, none of the several hundred Java programming tutorials that I have published in recent years, (several of which use recursion), are dedicated to an understanding of recursion in Java.

(The writing of a tutorial on recursion has been on my list of things to do for several years now.)

Fortunately, if you Google the keywords Java and recursion, you will find a variety of references, some good, and some not so good.  One that seems to be very good can be found on the web site of the City University, London.

The createBitCodes method

The createBitCodes method begins in Listing 23.  This method traverses the Huffman tree in a recursive manner to create a bit code for each character in the message.  The bit codes are different lengths with the shorter bit codes corresponding to the characters with a high usage frequency value and the longer bit codes corresponding to the characters with the lower frequency values.

This method receives a reference to the Huffman tree that was earlier contained as the only object in the TreeSet object.

A Huffman encoding table

The createBitCodes method creates a Huffman encoding table as a Hashtable object that relates the variable-length bit codes to the characters in the original message.  The bit codes are constructed as objects of type StringBuffer consisting of sequences of the characters 1 and 0 and converted to type String for storage in the Hashtable.

The traversal path

Each bit code describes the traversal path from the root of the Huffman tree to a leaf on that tree.  Each time the path turns to the left, a 0 character is appended onto the StringBuffer object and becomes a part of the resulting bit code.  Each time the path turns to the right, a 1 character is appended onto the StringBuffer object.

When a leaf is reached at the end of the traversal path, the character stored in that leaf is retrieved and put into the Hashtable object as a key.  A String representation of the StringBuffer object is used as the value for that key in the Hashtable.

The final result

At completion, the Hashtable object contains a series of keys consisting of the original characters in the message and a series of corresponding values as String objects (consisting only of 1 and 0 characters) representing the bit codes that will eventually be used to encode the original message.

The Hashtable object that is populated by this method is the data structure that is used to encode the message and is required later to decode the encoded message.

Test for the type of node

If you understand recursion in Java, you should find the createBitCodes method to be straightforward.  If not, you will probably find it to be very difficult to understand.

Listing 23 begins by testing the node to determine if it is an internal node or a leaf node.

  void createBitCodes(HuffTree tree){
    if(tree instanceof HuffNode){
      // This is a node, not a leaf.  Process it as a node.

      //Cast to type HuffNode.
      HuffNode node = (HuffNode)tree;
      // Get and save the left and right branches
      HuffTree left = node.getLeft();
      HuffTree right = node.getRight();
      
      //Append a 0 onto the StringBuffer object.  Then make
      // a recursive call to this method passing a
      // reference to the left child as a parameter.  This
      // recursive call will work its way all the way down
      // to a leaf before returning.  Then it will be time
      // to process the right path.
      code.append("0");
      createBitCodes(left);

Listing 23

If you understand recursion, the comments in Listing 23 should cause the code in Listing 23 to be self explanatory.

Process the right leg

Listing 24 picks up at the point where the recursive process has finally returned from its traversal down the left leg of the binary tree.  The code in Listing 24 processes the right leg of the binary tree.

      
      //Return to here from recursive call on left child.

      //Delete the 0 from the end of the StringBuffer
      // object to restore the contents of that object to
      // the same state that it had before appending the 0
      // and making the recursive call on the left branch.
      //Now we will make a right turn.  Append a 1 to the
      // StringBuffer object and make a recursive call to
      // this method passing a reference to the right child
      // as a parameter.  Once again, this recursive call
      // will work its way all the  way down to a leaf
      // before returning.
      code.deleteCharAt(code.length() - 1);//Delete the 0.
      code.append("1");
      createBitCodes(right);
      
      //Return to here from recursive call on right child.

      //Delete the character most recently appended to the
      // StringBuffer object and return from this recursive
      // call to the method.  The character is deleted
      // because control is being transferred back one
      // level in the recursive process and the
      // StringBuffer object must be restored to the same
      // state that it had when this recursive call was
      // made.
      code.deleteCharAt(code.length() - 1);

Listing 24

Process a leaf node

Listing 23 and Listing 24 were both concerned with the processing of internal nodes in the binary tree.  Listing 25 picks up at the point where it has been determined that the node is a leaf node instead of an internal node.  Listing 25 processes the leaf node.

    }else{
      //This is a leaf.  Process it as such.
      //Cast the object to type HuffLeaf.
      HuffLeaf leaf = (HuffLeaf)tree;
      
      //Put an entry into the Hashtable.  The Hashtable
      // key consists of the character value stored in the
      // leaf. The value in the Hashtable consists of the
      // contents of the StringBuffer object representing
      // the path from the root of the tree to the leaf.
      // This is the bitcode and is stored in the Hashtable
      // as a String consisting of only 1 and 0 characters.
      huffEncodeTable.put((char)(
                         leaf.getValue()),code.toString());
    }//end else

  }//end createBitCodes

Listing 25

Understanding the inner workings

Hopefully, you have been able to understand the inner workings of the method named createBitCodes.  If not, just take my word for it that the behavior of the method is to automate the process that I described earlier in the section entitled Traversing the Huffman tree.  Even if you don't understand exactly how the method named createBitCodes does its job, you should understand the outcome of invoking that method.

Display the bit codes

The code in Listing 26 can be enabled to display the bit codes that were created above and which are used to populate the Huffman encoding table.

    //For purposes of illustration only, enable the 
    // following two statements to display a table showing
    // the relationship between the characters in the
    // original message and the bitcodes that will replace
    // those characters to produce the Huffman-encoded
    // message.
/*
    System.out.println();
    displayBitCodes();
*/

Listing 26

The displayBitCodes method

You can view the displayBitCodes method in its entirety in Listing 45.  This method is very straightforward and should not require further explanation.

Sample bit code display

The bottom portion of Figure 12 shows the bit codes for the characters in the short message at the top of Figure 12.  Compare these bit codes with the bit codes that I produced manually in Figure 11.  Except for the display order, you should find that they match exactly.

Raw Data
AAAAABBBBCCCDDE

Number raw data bits: 120


Message Characters versus Huffman BitCodes
A 11
E 010
D 011
C 00
B 10
Figure 12

(If you investigate deeply enough, you will find that the method named createBitCodes develops the bit codes and populates the Hashtable in the same order as shown in Figure 11.  However, when an Enumeration object is used to retrieve and display the codes in the displayBitCodes method, the enumerator retrieves them in a different order.)

Bit codes for a longer message

The bottom portion of Figure 13 shows the variable-length bit codes for the somewhat longer test message shown at the top of Figure 13.

Raw Data
Now is the time for all good men to come to the
aid of their country.

Number raw data bits: 552


Message Characters versus Huffman BitCodes
. 100111
  00
N 111000
y 011101
w 110010
u 100110
t 010
s 011100
r 11101
o 101
n 10000
m 11110
l 10001
i 0110
h 11111
g 110011
f 01111
e 1101
d 111001
c 11000
a 10010
Figure 13

Whereas the bit codes in Figure 12 were all either two bits or three bits in length, the bit codes in Figure 13 range from two bits (00) at the shortest to six bits (111000 for example) at the longest.

Bit codes for a much longer message

The bottom portion of Figure 14 shows the bit codes for the characters in the even longer message shown at the top of Figure 14.

Raw Data
BAGHDAD, Iraq Violence increased across Iraq aft
er a lull following the Dec. 15 parliamentary el
ections, with at least two dozen people includin
g a U.S. soldier killed Monday in shootings and
bombings mostly targeting the Shiite-dominated s
ecurity services. The Defense Ministry director
of operations, Brig. Gen. Abdul Aziz Mohammed-Ja
ssim, blamed increased violence in the past two
days on insurgents trying to deepen the politica
l turmoil following the elections. The violence
came as three Iraqi opposition groups threatened
another wave of protests and civil disobedience
if allegations of fraud are not properly invest
igated.

Number raw data bits: 5048

Message Characters versus Huffman BitCodes
V 000100000    U 101110000    T 01010010     S 10001000
M 10001001     J 000110101    I 10001111     H 000110110
G 01010011     D 0101000      B 101110001    A 0001001
5 000100001    1 000110100    . 000111       - 00010001
, 0001100        110          z 10001110     y 000101
w 1000110      v 1011101      u 010101       t 1001
s 0100         r 0000         q 10111001     p 111100
o 1010         n 0111         m 100001       l 11111
k 000110111    i 1110         h 111101       g 101111
f 100000       e 001          d 10110        c 01011
b 1000101      a 0110
Figure 14

The bit codes for the message in Figure 14 range from three bits for the 'e' and the space character to nine bits for several characters including 'U', 'V', and 'k'.

(Note that I manually rearranged the bit codes in Figure 14 so that they would fit on a single screen for easier viewing.)

Encode message into a String of 1 and 0 characters

Back in the encode method, Listing 27 invokes the encodeToString method to encode the message into a String representation of the bits that will make up the final encoded message.

    encodeToString();

Listing 27

The encodeToString method also provides the optional capability to display the String showing the bit values that will appear in the final Huffman-encoded message.  This can be useful for comparing back against the bits in the original message for purposes of evaluating the amount of compression provided by encoding the message.

The encodeToString method

The encodeToString method is shown in its entirety in Listing 28.

  void encodeToString(){
    StringBuffer tempEncoding = new StringBuffer();
    for(int cnt = 0;cnt < rawData.length();cnt++){
      //Do a table lookup to get the substring that
      // represents the bitcode for each message character.
      // Append those substrings to the string that
      // represents the Huffman-encoded message.
      tempEncoding.append(huffEncodeTable.get(
                                     rawData.charAt(cnt)));
    }//end for loop
    
    //Convert the StringBuffer object to a String object.
    stringEncodedData = tempEncoding.toString();

    //For illustration purposes, enable the following two
    // statements to display the String showing the bit
    // values that will appear in the Huffman-encoded
    // message.  Display 48 bits to the line except for
    // the last line, which may be shorter, and which may
    // not be a multiple of 8 bits.
/*
    System.out.println("nString Encoded Data");
    display48(stringEncodedData);
*/
  }//end encodeToString

Listing 28

The encodeToString method encodes the message into a String representation of the bits that will make up the final encoded message.  The String consists of only 1 and 0 characters where each character represents the state of one of the bits in the Huffman-encoded message.  Also for illustration purposes, this method optionally displays the String showing the bit values that will appear in the Huffman-encoded message.

The encodeToString method is completely straightforward and shouldn't require further explanation.

Build an encoding bit map

Back in the encode method, Listing 29 invokes the buildEncodingBitMap method to populate a lookup table that relates eight bits represented as a String to every possible combination of eight actual bits.

    buildEncodingBitMap();

Listing 29

You can view the buildEncodingBitMap method in Listing 45.  It is so straightforward as to not warrant an explanation here.

Encode the String representation into actual bits

Listing 30 invokes the method named encodeStringToBits to encode the String representation of the bits that make up the encoded message to the actual bits that make up the encoded message.

    encodeStringToBits();

Listing 30

The encodeStringToBits method

The encodeStringToBits method is shown in its entirety in Listing 31.

  void encodeStringToBits(){
    //Extend the length of the stringEncodedData to cause
    // it to be a multiple of 8.
    int remainder = stringEncodedData.length()%8;
    for(int cnt = 0;cnt < (8 - remainder);cnt++){
      stringEncodedData += "0";
    }//end for loop
    
    //For illustration purposes only, enable the following
    // two statements to display the extended 
    // stringEncodedData in the same format as the 
    // original stringEncodedData.
/*
    System.out.println("nExtended String Encoded Data");
    display48(stringEncodedData);
*/
    //Extract the String representations of the required
    // eight bits.  Generate eight actual matching bits by
    // looking the bit combination up in a table.
    for(int cnt = 0;cnt < stringEncodedData.length();
                                                 cnt += 8){
      String strBits  = stringEncodedData.substring(
                                                cnt,cnt+8);
      byte realBits = encodingBitMap.get(strBits);
      binaryEncodedData.add(realBits);
    }//end for loop
  }//end encodeStringToBits

Listing 31

The purpose of the encodeStringToBits method is to create actual bit data that matches the 1 and 0 characters in the stringEncodedData that represents bits with the 1 and 0 characters.

Pad characters at the end

Note that this method doesn't handle the end of the data very gracefully for those cases where the number of required bits is not a multiple of 8.  The method simply adds enough "0" characters to the end to cause the length to be a multiple of 8.  This may result in extraneous characters at the end of the decoded message later.  However, it isn't difficult to remove the extraneous characters at decode time as long as the length of the original message is known.

For illustration purposes, this method may optionally display the extended version of the stringEncodedData for comparison with the non-extended version.

Output format

The binary Huffman-encoded data produced by this method is stored in a data structure of type ArrayList <Byte>.

Once you understand the methodology, the code in Listing 31 is straightforward and should not require further explanation.

Return from the encode method

Back in the encode method, the code in Listing 32 returns the binary Huffman-encoded data, terminates the encode method, and returns control to the main method where it was called in Listing 5.

    return binaryEncodedData;
  }//end encode method

Listing 32

Listing 32 also signals the end of the class definition for the class named HuffmanEncoder.

Meanwhile, back in the main method...

Listing 33 displays the number of bits in the encoded message and then computes and displays the compression factor.

    System.out.println("Number binary encoded data bits: " 
                           + binaryEncodedData.size() * 8);
    System.out.println("Compression factor: " 
      + (double)rawData.length()/binaryEncodedData.size());

Listing 33

An example of compression factor

Listing 33 produces the compression-factor output shown in the bottom portion of Figure 15 for the test message shown in the top of Figure 15.

Raw Data
BAGHDAD, Iraq Violence increased across Iraq aft
er a lull following the Dec. 15 parliamentary el
ections, with at least two dozen people includin
g a U.S. soldier killed Monday in shootings and
bombings mostly targeting the Shiite-dominated s
ecurity services. The Defense Ministry director
of operations, Brig. Gen. Abdul Aziz Mohammed-Ja
ssim, blamed increased violence in the past two
days on insurgents trying to deepen the politica
l turmoil following the elections. The violence
came as three Iraqi opposition groups threatened
another wave of protests and civil disobedience
if allegations of fraud are not properly invest
igated.

Number raw data bits: 5048
Number binary encoded data bits: 2816
Compression factor: 1.7926136363636365
Figure 15

Another example of compression factor

Similarly, the bottom portion of Figure 16 shows the compression factor achieved for the test message shown in the top of Figure 16.

Raw Data
Now is the time for all good men to come to the
aid of their country.

Number raw data bits: 552
Number binary encoded data bits: 272
Compression factor: 2.0294117647058822
Figure 16

Comparing the results in Figure 15 with the results in Figure 16, you can see that the compression results are highly dependent on the content of the message, even for the case where the message is composed entirely of English text.

Display the encoded message in hexadecimal format

The test message has now been Huffman encoded.  Continuing with the main method, Listing 34 invokes the method named hexDisplay48 to display the binaryEncodedData in hexadecimal format, 48 characters per line.

    System.out.println(
            "nBinary Encoded Data in Hexadecimal Format");
    hexDisplay48(binaryEncodedData);
    System.out.println();

Listing 34

The method named hexDisplay48

The method named hexDisplay48 can be viewed in its entirety in Listing 45.  The code in this method is completely straightforward and shouldn't require further explanation.

Sample hexadecimal display of Huffman-encoded data

The bottom portion of Figure 17 shows a hexadecimal display of the binary Huffman-encoded version of the message shown at the top of Figure 17

Raw Data
Now is the time for all good men to come to the
aid of their country.

Number raw data bits: 552
Number binary encoded data bits: 272
Compression factor: 2.0294117647058822

Binary Encoded Data in Hexadecimal Format
e2e43382^733efd734LZ735^49462676f27b602a62fb4545fd24dc
95e2feb74c59a0baece0
Figure 17

Comparing hexadecimal and raw data displays

When comparing the hexadecimal display with the display of the raw data, remember that each raw-data character represents eight bits while each hexadecimal character represents only four bits.  Therefore, even though the hexadecimal display appears to be about as long as the raw-data display, the number of bits required to represent the Huffman-encoded version of the message in Figure 17 is only 49.2 percent of the number of bits required to represent the raw unencoded version.  This results in a compression factor of 2.029 for this particular message.

Having encoded the message, the time has come to decode it to produce an exact (lossless) replica of the original message.





Page 1 of 3



Comment and Contribute

 


(Maximum characters: 1200). You have characters left.

 

 


Sitemap | Contact Us

Rocket Fuel