College of Engineering Aeronautics and Astronautics Agricultural and Biological Engineering Biomedical Engineering Chemical Engineering Civil Engineering Construction Engineering and Management Electrical and Computer Engineering Engineering Education Engineering Professional Education Environmental and Ecological Engineering Industrial Engineering Materials Engineering Mechanical Engineering Nuclear Engineering

EPICS (Engineering Projects In Community Service) First-Year Engineering Program First-Year Engineering Honors Program Global Engineering Program Minority Engineering Program Professional Practice (Co-Op) Program Women in Engineering Program

College Administration Schools Programs All Groups All People ECN Webmail


	College of Engineering \| ECN
Home Tools Demos

Purdue > Engineering > Object-Oriented Software Development > F2012 > IPA

ECE264 IPA 2-3

Huffman Coding: Decompressing a Huffman encoded file

Due December 6, 2012 @ 6:00pm

The description is mainly taken from Professor Vijay Raghunathan. In this assignment, you will utilize your knowledge about priority queues, stacks, and trees to design a file compression program and file decompression program (similar to zip and unzip). You will base your utilities on the widely used algorithmic technique of Huffman coding, which is used in JPEG compression as well as in MP3 audio compression.

ASCII coding

Many programming languages use ASCII (which stands for American Standard Code for Information Interchange) coding to represent characters. In ASCII coding, every character is encoded (represented) with the same number of bits (8-bits) per character. Since there are 256 different values that can be represented with 8-bits, there are potentially 256 different characters in the ASCII character set, as shown in the ASCII character table available at http://www.asciitable.com/.

Let us now look at a simple example of ASCII encoding of characters. Using ASCII encoding (8 bits per character) the 13-character string "go go gophers" requires 13 * 8 = 104 bits. The table below shows how the coding works.

Character	ASCII code	8-bit binary value
Space	32	00100000
e	101	01100101
g	103	01100111
h	104	01101000
o	111	01101111
p	112	01110000
r	114	01110010
s	115	01110011

The given string would be written as the following stream of bits (the spaces would not be written, just the 0's and 1's)

01100111 01101111 00100000 01100111 01101111 00100000 01100111 01101111 01110000 01101000 01100101 01110010 01110011

From ASCII coding towards Huffman coding

Next, let us see how we might use fewer bits using a simpler coding scheme. Since there are only 8 different characters in "go go gophers", it is possible to use only 3-bits to encode the 8 different characters. We might, for example, use the coding shown in the table below (keep in mind that other 3-bit encodings are also possible).

Character	Code Value	3-bit binary value
g	0	000
o	1	001
p	2	010
h	3	011
e	4	100
r	5	101
s	6	110
Space	7	111

Now the string "go go gophers" would be encoded as: 000 001 111 000 001 111 000 001 010 011 100 101 110. As you can see, by using three bits per character instead of eight bits per character that ASCII uses, the string "go go gophers" uses a total of 39 bits instead of 104 bits.

However, even in this improved coding scheme, we used the same number of bits to represent each character, irrespective of how often the character appears in our string. Even more bits can be saved if we use fewer than three bits to encode characters like g, o, and space that occur frequently and more than three bits to encode characters like e, h, p, r, and s that occur less frequently in "go go gophers". This is the basic idea behind Huffman coding: to use fewer bits for characters that occur more frequently. We'll see how this is done using a tree data structure that stores the characters as its leaf nodes, and whose root-to-leaf paths provide the bit sequence used to encode the characters.

Towards a Coding Tree

Using a binary tree for coding, all characters are stored at the leaves of a tree. A left-edge is numbered 0 and a right-edge is numbered 1. The code for any character/leaf node is obtained by following the root-to-leaf path and concatenating the 0's and 1's. The specific structure of the tree determines the coding of any leaf node using the 0/1 edge convention described. As an example, the tree below on the right yields the coding shown on the left.

Character	Binary code
' '	101
'e'	1100
'g'	00
'h'	1101
'o'	01
'p'	1110
'r'	1111
's'	100

Using this coding, "go go gophers" is encoded (again, spaces would not appear in the bit-stream) as: 00 01 101 00 01 101 00 01 1110 1101 1100 1111 100. This is a total of 37 bits, two bits fewer than the improved encoding in which each of the 8 characters has a 3-bit encoding! The bits are saved by coding frequently occurring characters like 'g' and 'o' with fewer bits (here two bits) than characters that occur less frequently like 'p', 'h', 'e', and 'r'.

To decode a given stream that has been coded by the given tree, start at the root of the tree, and follow a left-branch if the next bit in the stream is a 0, and a right branch if the next bit in the stream is a 1. When you reach a leaf, write the character stored at the leaf, and start again at the top of the tree. The bit stream 10011101101110011111100 yields right-left-left to the letter 's', followed (starting again at the root) with right-right-right-left to the letter 'p', followed by right-right-left-right to the letter 'h'. Continuing thus yields a decoded string "sphere."

Prefix codes

When all characters are stored in leaves, and every interior (non-leaf) node has two children, the coding induced by the 0/1 convention outlined above satisfies a very important property called the prefix property which states that no bit-sequence encoding of a character is the prefix of the bit-sequence encoding of any other character. This makes it possible to decode a bitstream using the coding tree by following root-to-leaf paths. The tree shown above for "go go gophers" satisfies this prefix property and is an optimal tree. There are other trees that use 37 bits; for example you can simply swap any sibling nodes and get a different encoding that uses the same number of bits. Next, we look at an algorithm for constructing such an optimal tree. This algorithm is called Huffman coding, and was invented by David A. Huffman in 1952 when he was a Ph.D. student at MIT.

We now show another tree to compress the string "streets are stone stars are not" optimally. To encode "streets are" we would have the following bits: 1110001111011000111101010011110.

Another Huffman Tree/Table Example

Character	Binary value
' '	101
'a'	010
'e'	110
'n'	1000
'o'	1001
'r'	011
's'	111
't'	00

It is important to note that you cannot use the tree built for the string "go go gophers" to decode the bitstreams obtained from the encoding of "streets are stone stars are not" as the encoding is performed using a different tree.

Implementing/Programming Huffman Coding

In this section we'll see the basic programming steps in implementing Huffman coding. There are two parts to an implementation: a compression program and a decompression program. We'll assume these are separate programs, but they actually have many common functions.

Header Information

You must store some initial information in the compressed file that will be used by the decompression/unhuffing program. Basically, you must store the tree that is used to compress the original file. This is because the decompression program needs this exact same tree in order to decode the data. The header information contains:

The topology of the Huffman coding tree. To store the tree at the beginning of the file, we use a post-order traversal, writing each node visited. When you encounter a leaf node, you write a 1 followed by the ASCII character of the leaf node. When you encounter a non-leaf node, you write a 0. To indicate the end of the Huffman coding tree, we write another 0.

The total number of characters in the input file, followed by a newline character.

Consider the string "go go gophers", the header information is "1g1o01s1 01e1h01p1r0000013\n", where "\n" is the newline character. The post-order traversal of the Huffman coding tree gives us "1g1o01s1 01e1h01p1r0000". Another "0" separates the topology from "13", which is the number of characters in the input file.

For the string "streets are stone stars are not", the header information is "1t1a1r001n1o01 01e1s000031\n".

The header information can be made more economical if we use bits 0 and 1 to distinguish between non-leaf and leaf nodes, and also to indicate the end of a topology. In these two examples, there will be a total of 10 bytes (8 bytes for the leaf nodes and 2 bytes for all the 0's and 1's). The challenge here is that both the compression and decompression programs would have to handle bits instead of characters. For example, in the bit-based approach, the first 16 bits (or the first 2 bytes) of the header information for encoding the string "streets are stone stars are not" are 10111010 01011000 (note that the space in the bit stream is introduced for better clarity). The ASCII coding for 't' straddles the two bytes, 7 bits in the first byte and 1 bit in the second byte. The second most significant bit in the second byte is a 1, indicating that the next 8 bits is an ASCII character, of which the the most significant 6 bits of the character 'a' is contained in the least significant 6 bits of the second byte.

In the bit-based representation of the Huffman coding tree, the last byte may not contain 8 bits. In this case, we again pad it with 0 bits. Consider the case where the input file uses nine distinct ASCII characters. The number of bits required to represent the Huffman coding tree is 9×8 + 9×2 = 90 bits, which can represented by 12 bytes. In other words, the last byte should contain only two useful bits. The 12 bytes are followed by "31\n", the number of characters in the string and the newline character.

The Decompression or Dehuffing Program

Given a compressed file, which contains the header information, followed by a bit stream corresponding to the encoding of the original file, the decompression program should perform the following tasks:

1. Build a Huffman coding tree based on the header information.

2. Read the number of characters stored in the compressed file from the header information.

3. Read bit-by-bit from the compressed file (starting at the location after the header information). Use the read bit to traverse the Huffman coding tree (0 for left and 1 for right), starting from the root node. When the program reaches a leaf node, print the corresponding ASCII character to the output file. Starts from the root node of the Huffman coding tree when the next bit is read. The program terminates when the number of characters decoded matches the number of characters stored in the header information.

To construct a Huffman coding tree from the header information, we make use of a stack. When the bit 1 is read, we read the next byte and push the corresponding ASCII character onto the stack. When the bit 0 is read, if the stack contains only one element, we have constructed the entire Huffman coding tree. Otherwise, there must be more than one element in the stack. We create a new node, and pop the top two elements off the stack. We pop the first element from the stack to be the right child of the new node, and the second element of the stack to be the left child of the new node. After that, we push the newly created node onto the stack.

A challenge is in the reading of a bit from the compressed file. Again, reading from a file occurs at the byte-level. In the decompression program, you have to read an 8-bit ASCII character that straddle two bytes in the compressed file when you want to construct a Huffman coding tree. You also have to read one bit at a time to determine whether you are dealing with a leaf node or a non-leaf node. Of course, you also have to read one bit at a time when you want to decode the compressed file.

IPA 2-3 (5 points) November 29, 2012 @ 6:00 pm

In this assignment, you will write the program that takes in a compressed file (the first argument) containing bit-based header information, decompresses it and prints the results into an output file (the second argument). You can download the testcases and solutions(expected outputs) here.

Requirements

The program takes two input arguments, the first is the compressed file, and the second is the decompressed output file. If there's not enough file names provided or the files cannot be opened, the program terminates with return value EXIT_FAILURE. This constant is defined in stdlib.h and your program needs to include that header file.

You must include a Makefile in your submission. Your program will be compiled and built using your Makefile. The name of the executable file must be 'ipa2_3' (case-sensitive, no extension, not ipa2_3.exe).

We will not provide the interactive computer grader in this IPA. Please test the program before submission.

Your code needs to follow the coding standards specified here. We will manually check this and you can lose points if standards are not followed. You can find examples of how to follow the coding standards here.

Your program will read the compressed file and keep track of all necessary information in memory.

Hints

A strategy is to always read a byte from the compressed file. You also maintain an index to indicate where you are in that byte.

For example, when you read in a character, the index should be pointing to the most significant bit (i.e., the 7th bit, assuming that the least significant bit is the 0th bit). After you have processed one bit, the index should be decremented by 1.

You need to use structures for this assignment. A structure about the Huffman coding should include the relevant information

We will check whether your program initializes all pointers to NULL. This can be accomlished in the following ways
- int * ptr = NULL; /* initialized to NULL, required */
- int * ptr;

ptr = NULL; /* immediately after declaring ptr */

int * ptr = malloc(numberElement * sizeof(int));

If you write the following code without initialization, you will lose 0.2 point.

int * ptr; /* uninitialized pointer. NOT ALLOWED */

Why do we check this? Uninitializing pointers make a program's behavior unpredictable.

(You are also allowed to declare and allocate memory for a pointer immediately using malloc() or calloc().)

WHAT YOU NEED TO SUBMIT?

1. All the .c and .h files needed to execute your assignment. All submission for this IPA should have at least 2 .c files and 1 .h file. You can have more .c and .h files as you see fit. Any submissions with one .c file will have points deducted.

2. Makefile to compile your assignment. For each submission, the executable should be names ipa2_3.

3. A README file that describes the algorithm used in your assignment. The README should say how the functions accomplish their goals, not just the goals themselves.

HOW TO SUBMIT?

1. You can submit as many submission on the checking website as you would need. For this assignment, if you pass the testcases on the server, it will not guarantee a perfect score during grading.

2. Your submission for grading should be submitted through Blackboard. You can submit as many times on Blackboard as you need before the deadline. We will grade the final submission, so please make sure that it is your FINAL version with everything included.

HOW WILL WE GRADE YOUR SUBMISSION?

We will grade your assignments on algorithm correctness through DDD, memory allocation and deletions through valgrind , proper coding standards, content of README, a proper Makefile, and commenting.