Advanced C Programming

Fall 2017 :: ECE 264 :: Purdue University

This is for Fall 2017 (7 years ago) only.
Due 11/29

Huffman #3 (A)

This exercise will familiarize you with linked lists, which you will need for a subsequent programming assignment.

Getting Started

Start by getting the files. Type 264get hw14 and then cd hw14 from bash.
You will get the following files:

  1. huffman.h: An empty header file, you have to define your own functions in this homework.
  2. huffman.c: An empty c file, you have to define your own functions in this homework.
  3. huffman_main.c: You will use self defined functions to implement the huffman coding in this file. See requirements for detail
  4. examples: A folder with examples of input and output file.
Same as HW13, you have the flexibility to design your functions in this homework.

Overview

We assume that you have read the huffman coding and successfully completed HW12 and HW13. At this point, you should be able to construct a Huffman coding tree from a non-empty input file.

This assignment deals with the representation of a Huffman coding tree using a series of characters, or more economically using a series of bits. For more details, see the section "Header Information" in huffman coding. That section describes how a compressed file can be represented. Our focus is on how the topology of the Huffman coding tree can be represented in the compressed file.

The following description is adapted from that section: To represent a Huffman coding tree, we use a post-order traversal, writing each node visited. Post-order means that the writing of a node occurs after visiting the child nodes. When you encounter a leaf node, you write a 1 followed by the ASCII character of the leaf node. When you encounter a non-leaf node, you write a 0. To indicate the end of the Huffman coding tree, we write another 0.

Consider the following two examples, the header information of the string "go go gophers" is "1g1o01s1 01e1h01p1r00000". The post-order traversal of the Huffman coding tree gives us "1g1o01s1 01e1h01p1r0000". Another "0" indicates the end of the topology (see huffman coding for the topology).

String Header information
go go gophers 1g1o01s1 01e1h01p1r00000
streets are stone stars are not 1t1a1r001n1o01 01e1s0000

In these two examples, we use characters 0 and 1 to distinguish between non-leaf and leaf nodes (and 0 to indicate the end of a topology). As there are eight leaf nodes in each of the two examples, there are eight 1's, seven 0's for non-leaf nodes, and another 0 to indicate that we have reached the end of a topology. This approached used a total of 24 bytes.

Using bits to represent the header information

The header information can be made more economical if we use bits 0 and 1 to distinguish between non-leaf and leaf nodes, and also to indicate the end of a topology. In these two examples, there will be a total of 10 bytes (8 bytes for the leaf nodes(i.e., 'g', 'o', 'p' ... etc in 'go go gophers') and 2 bytes for all the bits 0's and 1's (i.e., 0's and 1's used in header information to distinguish non-leaf and leaf nodes. )).

Header information 1t1a1r001n1o01 01e1s0000
Header information(bit) 101110100101100001...
For example, in the bit-based approach, the first 16 bits (or the first 2 bytes) of the header information for encoding the string "streets are stone stars are not" are 10111010 01011000 (note that the space in the bit stream is introduced for better clarity). The ASCII coding for 't' straddles the two bytes, 7 bits in the first byte and 1 bit in the second byte. The second most significant bit in the second byte is a 1, indicating that the next 8 bits is an ASCII character, of which the the most significant 6 bits of the character 'a' is contained in the least significant 6 bits of the second byte.

In the bit-based representation of the Huffman coding tree, the last byte may not contain 8 meaningful bits. In this case, we pad it with 0 bits. Consider the case where the input file uses nine distinct ASCII characters. The number of bits required to represent the Huffman coding tree is 9×8 + 9×2 = 90 bits, which can be represented by 12 bytes. In other words, the last byte should contain only two useful bits. The least significant 6 bits should be padded with 0's.

Expected arguments for executable

We expect the huffman executable to accept 4 arguments: ./huffman input_file output_file_1 output_file_2 output_file_3
The executable should build a Huffman coding tree based on the input file (input_file) according to the description in HW13, and produce three output files. The examples folder provide some sample input files: gophers, lorum, stone, woods.
file contents
output_file_1
  1. The file contains as many lines as the number of distinct characters that appear in the input file.
  2. Each distinct character should have a line in this output file with the format of
    character:stream_of_0_and_1
  3. where the stream of 0 and 1 (as character '0' and character '1') being the binary pattern corresponding to the Huffman code of the character.
  4. The order in which you print the characters is based on a tree traversal of the Huffman coding tree that visits the left branch first, followed by the right branch.
  5. The corresponding output files for the input files in the examples folder are gophers.huffman, lorum.huffman, stone.huffman, and woods.huffman.
  6. Note that this is the last output file expected of huffman, the executable produced in HW13.
output_file_2
  1. The file contains a representation of the Huffman coding tree using a stream of characters obtained using a post-order traversal of the Huffman coding tree as described above.
  2. The corresponding output files for the input files in the examples folder are gophers.char, lorum.char, stone.char, and woods.char.
output_file_3
  1. The file contains a representation of the Huffman coding tree using a stream of bits obtained using a post-order traversal of the Huffman coding tree as described above.
  2. The bit stream would have to be represented as characters in the output file.
  3. The last character/byte in the output file may contain padding bits of 0 in the less significant positions.
  4. The corresponding output files for the input files in the examples folder are gophers.bit, lorum.bit, stone.bit, and woods.bit.

The output files for the example gophers were generated with the following command: ./huffman examples/gophers examples/gophers.huffman examples/gophers.char examples/gophers.bit

Bit-wise operations

You would have to use bit-wise operations in one or more functions to complete this assignment. In particular, these are operations that would be useful:

Operations Descriptions
>> shift right operator
<< shift left operator
& bit-wise and operator
| bit-wise or operator

Note that when a shift right operator is applied to signed representation (char, short, int, or long), the signed bit (bit in the most significant position of a 8-bit, 16-bit, 32-bit, or 64-bit number) is preserved. In other words, after shifting, the polarity of the number, i.e., whether the number is positive or negative, is preserved. For an unsigned representation, 0 is shifted into the most significant bit.

Suppose we are using an int to store a char (i.e., only the least significant byte of an int is used to store the useful information), and we are interested in splitting the char into two parts: 3 most significant bits and 5 least significant bits, and then exchange the parts such that the 3 bits now occupy the least significant positions and the 5 bits now occupy the most signficant positions:

  int old_ch;  // the original char
  int mask = 0xFF;  // a mask of all ones, 0x means hexadecimal, F is 1111
  int MS_mask = (mask >> 5) << 5;  // create a mask of 3 bits of 1 at most
                                   // significant positions, 0xE0
  int LS_mask = mask >> 3;         // create a mask of 5 bits of 1 at least
                                   // significant positions, 0x1F
  int LS_3_bits = (MS_mask & old_ch) >> 5;  // get the least significant 3 bits
                                            // of the new byte
  int MS_5_bits = (LS_mask & old_ch) << 3;  // get the most significant 5 bits
                                            // of the new byte
  int new_ch = LS_3_bits | MS_5_bits;  // get the new char 

Note that by using an int to store a char, we do not have to worry how the polarity may be affected by shifting.

In this code fragment, we use shifting to create masks (MS_mask and LS_mask) of 1 bits from the variable mask. These masks are used to extract 3 bits and 5 bits from old_ch using the bit-wise and operator. Moreover, the extracted bits are moved to correct positions using the shift operators. The variable new_ch is obtained by combining the extracted (and shifted) bits using the bit-wise or operator.

Note that there are many ways to accomplish the above. We chose a (laborious) way that, we think, demonstrates the operators and the use of masks.

How to check your output

You can use "diff" command to check your output against those provided in the examples folder.

A useful utilities is the xxd hexdump command. It shows the contents of a file in the hexadecimal representation. For example, xxd gophers.char gives you…

$ xxd gophers.char
0000000: 3167 316f 3031 7331 2030 3165 3168 3031  1g1o01s1 01e1h01
0000010: 7031 7230 3030 3030                      p1r00000

The left most column is the starting byte count (in hexadecimal). The far right side of the output shows a rough interpretation of the bytes as ASCII characters (instead of the numeric values of the bytes). Unprintable characters—in the range 0-31 and 127-255, including the newline and tab characters, for example—are displayed as a single period (“.”). In between, the contents of the file are listed in 2-byte groups, in same order they are stored in the file. For example, the first byte of gophers.char is 0x31; the second byte is 0x67.

If that grouping is confusing, you can add -g1 to tell xxd to display the bytes separately.

$ xxd -g1 gophers.char
0000000: 31 67 31 6f 30 31 73 31 20 30 31 65 31 68 30 31  1g1o01s1 01e1h01
0000010: 70 31 72 30 30 30 30 30                          p1r00000

You can do the same for gophers.bit:

$ xxd gophers.bit
0000000: b3db d739 02cb 685c 2e40                 ...9..h\.@

… or displaying one byte at a time…

$ xxd -g1 gophers.bit
0000000: b3 db d7 39 02 cb 68 5c 2e 40                    ...9..h\.@

You can also view the bytes in binary notation by adding -b.

$ xxd -b gophers.bit
0000000: 10110011 11011011 11010111 00111001 00000010 11001011  ...9..
0000006: 01101000 01011100 00101110 01000000                    h\.@

From that, you can directly see the bytes that your code will be writing to the file. If it helps, you can add -g10 -c10 to omit the spaces.

$ xxd -b -g10 -c10 gophers.bit
0000000: 10110011110110111101011100111001000000101100101101101000010111000010111001000000  ...9..h\.@

From there, you can imagine the boundaries between the marker bits (0 or 1) and the bits comprise characters.

1M01100111char:'g'1M01101111char:'o'0M1M01110011char:'s'1M00100000char:' '0M1M01100101char:'e'1M01101000char:'h'0M1M01110000char:'p'1M01110010char:'r'0M0M0M0M0M

Requirements

  1. Your submission must contain each of the following files, as specified:
    file contents
    huffman.h functions You are given flexibility to define your own functions.
    huffman.c functions You are given flexibility to define your own functions.
    huffman_main.c functions
    main(int argc, char✶ argv[])
    return type: int
    1. Use self defined functions to implement huffman coding in this file.
    2. The executable of this file takes one input_file and produces three output files. We expect 4 arguments, these arguments should be entered in the following order.
      1. input_file
      2. output_file_1
      3. output_file_2
      4. output_file_3
    3. The content of output_file_1, output_file_2, output_file_3 should follow the format described above.
    4. If executable is not supplied with exactly 4 arguments, the function returns EXIT_FAILURE.
  2. You will not receive any credit if we fail to obtain "huffman".
  3. You will not receive any credit if we fail to obtain output_file_1, output_file_2, and output_file_3.
  4. You will not receive any credit if the content of output_file_1, output_file_2, and output_file3 do not follow the format as specified in homework description.
  5. Do not print error messages or anything else on stdout (i.e., with printf(…)), except as directed by the instructions. Your main(…) can (and should) print some output to stdout.
  6. Do not include a main(…) in huffman.c. (→ zero credit, since we won't be able to compile.)
  7. Submissions must meet the code quality standards and the course policies on homework and academic integrity.

Submit

To submit HW14, type 264submit HW14 huffman.c huffman.h huffman_main.c from inside your hw14 directory.

Pre-tester

The pre-tester for HW14 has been released and is ready to use.

Q&A

Updates

11/09/2017 draft