Huffman #3 (A)
This exercise will familiarize you with linked lists, which you will need for a subsequent programming assignment.
Start by getting the files. Type
264get hw14 and then
cd hw14 from bash.
You will get the following files:
- huffman.h: An empty header file, you have to define your own functions in this homework.
- huffman.c: An empty c file, you have to define your own functions in this homework.
- huffman_main.c: You will use self defined functions to implement the huffman coding in this file. See requirements for detail
- examples: A folder with examples of input and output file.
This assignment deals with the representation of a Huffman coding tree using a series of characters, or more economically using a series of bits. For more details, see the section "Header Information" in huffman coding. That section describes how a compressed file can be represented. Our focus is on how the topology of the Huffman coding tree can be represented in the compressed file.
The following description is adapted from that section: To represent a Huffman coding tree, we use a post-order traversal, writing each node visited. Post-order means that the writing of a node occurs after visiting the child nodes. When you encounter a leaf node, you write a 1 followed by the ASCII character of the leaf node. When you encounter a non-leaf node, you write a 0. To indicate the end of the Huffman coding tree, we write another 0.
Consider the following two examples, the header information of the string "go go gophers" is "1g1o01s1 01e1h01p1r00000". The post-order traversal of the Huffman coding tree gives us "1g1o01s1 01e1h01p1r0000". Another "0" indicates the end of the topology (see huffman coding for the topology).
|go go gophers||1g1o01s1 01e1h01p1r00000|
|streets are stone stars are not||1t1a1r001n1o01 01e1s0000|
In these two examples, we use characters 0 and 1 to distinguish between non-leaf and leaf nodes (and 0 to indicate the end of a topology). As there are eight leaf nodes in each of the two examples, there are eight 1's, seven 0's for non-leaf nodes, and another 0 to indicate that we have reached the end of a topology. This approached used a total of 24 bytes.
Using bits to represent the header information
The header information can be made more economical if we use bits 0 and 1 to distinguish between non-leaf and leaf nodes, and also to indicate the end of a topology. In these two examples, there will be a total of 10 bytes (8 bytes for the leaf nodes(i.e., 'g', 'o', 'p' ... etc in 'go go gophers') and 2 bytes for all the bits 0's and 1's (i.e., 0's and 1's used in header information to distinguish non-leaf and leaf nodes. )).
|Header information||1t1a1r001n1o01 01e1s0000|
In the bit-based representation of the Huffman coding tree, the last byte may not contain 8 meaningful bits. In this case, we pad it with 0 bits. Consider the case where the input file uses nine distinct ASCII characters. The number of bits required to represent the Huffman coding tree is 9×8 + 9×2 = 90 bits, which can be represented by 12 bytes. In other words, the last byte should contain only two useful bits. The least significant 6 bits should be padded with 0's.
Expected arguments for executableWe expect the huffman executable to accept 4 arguments:
./huffman input_file output_file_1 output_file_2 output_file_3
The executable should build a Huffman coding tree based on the input file (input_file) according to the description in HW13, and produce three output files. The examples folder provide some sample input files: gophers, lorum, stone, woods.
The output files for the example gophers were generated with the following command:
./huffman examples/gophers examples/gophers.huffman examples/gophers.char examples/gophers.bit
You would have to use bit-wise operations in one or more functions to complete this assignment. In particular, these are operations that would be useful:
|>>||shift right operator|
|<<||shift left operator|
|&||bit-wise and operator|
||||bit-wise or operator|
Note that when a shift right operator is applied to signed representation (char, short, int, or long), the signed bit (bit in the most significant position of a 8-bit, 16-bit, 32-bit, or 64-bit number) is preserved. In other words, after shifting, the polarity of the number, i.e., whether the number is positive or negative, is preserved. For an unsigned representation, 0 is shifted into the most significant bit.
Suppose we are using an int to store a char (i.e., only the least significant byte of an int is used to store the useful information), and we are interested in splitting the char into two parts: 3 most significant bits and 5 least significant bits, and then exchange the parts such that the 3 bits now occupy the least significant positions and the 5 bits now occupy the most signficant positions:
int old_ch; // the original char int mask = 0xFF; // a mask of all ones, 0x means hexadecimal, F is 1111 int MS_mask = (mask >> 5) << 5; // create a mask of 3 bits of 1 at most // significant positions, 0xE0 int LS_mask = mask >> 3; // create a mask of 5 bits of 1 at least // significant positions, 0x1F int LS_3_bits = (MS_mask & old_ch) >> 5; // get the least significant 3 bits // of the new byte int MS_5_bits = (LS_mask & old_ch) << 3; // get the most significant 5 bits // of the new byte int new_ch = LS_3_bits | MS_5_bits; // get the new char
Note that by using an int to store a char, we do not have to worry how the polarity may be affected by shifting.
In this code fragment, we use shifting to create masks (MS_mask and LS_mask) of 1 bits from the variable mask. These masks are used to extract 3 bits and 5 bits from old_ch using the bit-wise and operator. Moreover, the extracted bits are moved to correct positions using the shift operators. The variable new_ch is obtained by combining the extracted (and shifted) bits using the bit-wise or operator.
Note that there are many ways to accomplish the above. We chose a (laborious) way that, we think, demonstrates the operators and the use of masks.
How to check your output
You can use "diff" command to check your output against those provided in the examples folder.
A useful utilities is the
hexdump command. It shows
the contents of a file in the hexadecimal representation. For example,
xxd gophers.char gives you…
$ xxd gophers.char0000000: 3167 316f 3031 7331 2030 3165 3168 3031 1g1o01s1 01e1h01 0000010: 7031 7230 3030 3030 p1r00000
The left most column is the starting byte count (in hexadecimal). The far right side of the output shows a rough interpretation of the bytes as ASCII characters (instead of the numeric values of the bytes). Unprintable characters—in the range 0-31 and 127-255, including the newline and tab characters, for example—are displayed as a single period (“.”). In between, the contents of the file are listed in 2-byte groups, in same order they are stored in the file. For example, the first byte of gophers.char is 0x31; the second byte is 0x67.
If that grouping is confusing, you can add
to tell xxd to display the bytes separately.
$ xxd -g1 gophers.char0000000: 31 67 31 6f 30 31 73 31 20 30 31 65 31 68 30 31 1g1o01s1 01e1h01 0000010: 70 31 72 30 30 30 30 30 p1r00000
You can do the same for gophers.bit:
$ xxd gophers.bit0000000: b3db d739 02cb 685c 2e40 ...9..h\.@
… or displaying one byte at a time…
$ xxd -g1 gophers.bit0000000: b3 db d7 39 02 cb 68 5c 2e 40 ...9..h\.@
You can also view the bytes in binary notation by adding
$ xxd -b gophers.bit0000000: 10110011 11011011 11010111 00111001 00000010 11001011 ...9.. 0000006: 01101000 01011100 00101110 01000000 h\.@
From that, you can directly see the bytes that your code will be writing to the file.
If it helps, you can add
-g10 -c10 to omit the spaces.
$ xxd -b -g10 -c10 gophers.bit0000000: 10110011110110111101011100111001000000101100101101101000010111000010111001000000 ...9..h\.@
From there, you can imagine the boundaries between the marker bits (0 or 1) and the bits comprise characters.
- Your submission must contain each of the following files, as specified:
file contents huffman.h functions You are given flexibility to define your own functions. huffman.c functions You are given flexibility to define your own functions. huffman_main.c functions
main(int argc, char✶ argv)→ return type: int
- Use self defined functions to implement huffman coding in this file.
The executable of this file takes one input_file and produces three output files. We expect 4 arguments, these arguments should be entered in the following order.
- The content of output_file_1, output_file_2, output_file_3 should follow the format described above.
- If executable is not supplied with exactly 4 arguments, the function returns EXIT_FAILURE.
- ⚠ You will not receive any credit if we fail to obtain "huffman".
- ⚠ You will not receive any credit if we fail to obtain output_file_1, output_file_2, and output_file_3.
- ⚠ You will not receive any credit if the content of output_file_1, output_file_2, and output_file3 do not follow the format as specified in homework description.
⚠ Do not print error messages or anything else on
printf(…)), except as directed by the instructions. Your
main(…)can (and should) print some output to
- ⚠ Do not include a
main(…)in huffman.c. (→ zero credit, since we won't be able to compile.)
- Submissions must meet the code quality standards and the course policies on homework and academic integrity.
To submit HW14, type
264submit HW14 huffman.c huffman.h huffman_main.c
from inside your hw14 directory.
The pre-tester for HW14 has been released and is ready to use.