22C:030/115 Computer Science III

Fall 2001

Program 3: Huffman Codes

DUE DATE:  12/3/01

This program requires you to write a program to decode strings encoded with Huffman codes.  A Huffman code represents characters in variable length bit strings.  Frequently occurring characters are represented with short bit strings; characters that occur rarely are represented with long bit strings.  You are required to develop a program with appropriate class definitions that converts  a coded string into it's uncoded representation.

The input to your program will be two files.  One file containing the code and one file containing a coded string to be decoded.

The assignment will give you experience in the use of binary trees, exposure to data compression techniques,  and experience in solving a problem from a specification.
 



 
 
 
What to submit:
You should submit a directory that includes:
  1. README: A file giving a overview of the contents of the directory including:
  2. Class definition and implementation files.
  3. A program that decodes a Huffman coded bit string.
  4. Printed copies of the README,class definition and implementation files, and the program to be submitted in lecture on the due date.

Huffman Coding

A Huffman code maps characters into bit sequences.  Codes are used to generate compact binary representations of character strings. Each character in the alphabet of allowable charcters is represented by a specific bit sequence. To create compact encodings, the Huffman coding scheme uses variable length encodings. Characters that occur frequently are coded with short bit sequences; characters that occur in infrequently are coded with long bit sequences. For example, consider the following table specifying a Huffman code for the alphabet A,T,C,G
 
Character Bit Code
A 0
T 100
C 101
G 11

Using this code, the string ATACG would be represented by the coded bit string 0100010111, where each character is replaced by its bit code.

In this assignment you are to write a program that decodes Huffman coded bit strings.  Your program should take as input a file that specifies a Huffman code and a file that contains a bit string encoded with the given Huffman code.  You can assume the file containing the Huffman  contains one line for each character in the alphabet.  Each line will be begin with the character followed by a single space and then the character's bit code.  The file containing the coded string will consist of one line containing a sequence of 0's and 1's. Example files are available in the directory /group/class/c030/data.

To solve the decoding problem, you must first create a Huffman tree from the table of Huffman codes.  A Huffman tree is an extended binary tree in which the leaf nodes contain the characters of the alphabet.  The tree is structured such that the path to a leaf node is  determined by the bit code, where 0 is interpreted as left and 1 is interpreted as right.  Thus, the code above defines the following Huffman tree:

Note that interior nodes contain no data.  You must devise a method to create a Huffman tree from a Huffman table. You should not assume that the aphabet contains any particular symbols or is of any particular size.

The second part of your assignment is to decode a bit string. To do this you should read bits from the bits sequence and descend the tree in the appropriate direction until a leaf node is found.  Each successive character visited should be printed to an output file.  This process must be repeated until the bit sequence is exhausted.

Huffman codes are optimal in that sense that they produce minimal length encodings given the frequency with which characters occur.  A more detailed explanation of Huffman Coding with illustrations is available here along with descriptions of some other coding methods.

Input/Output

Your program should take two files as input: a table of Huffman codes and a coded string.  Your program should produce as output a file containing the decoded string.   You can fix the names of the input and output files in your program.  For consistency, use the following file names: If you are more ambitious, you can organize your program to read the input file names from the Unix command line.

Additional Assistance

The text contains an appendix on file input and output. See page 726. A simple program demonstrating file input/output is provided here (courtesy of Nate Brixius).