Huffman 2: files, frequencies
Learning goals
You should learn how to:
- Open, read, and close files.
- Pass/return statically declared arrays to/from functions.
- Use fixed-size integer types.
- Implement a step toward the Huffman coding algorithms
Overview
In this exercise, you will create another part of the Huffman coding (data compression) code that you are developing. For this part, you will create a function that reads a file and calculates the frequencies of every character in the file.
Read the explanation of huffman coding. You will not need to understand the whole process to complete this homework, but it will help you understand how this homework and the last one fit into the larger algorithm.
Getting Started
Get the starter file (frequencies.h). There isn't much in the file. You will be filling in the rest.
Add two type aliases to the header file (frequencies.h).
typedef unsigned char uchar; typedef uint64_t Frequencies[256];
Although we could have put these in the header file for you, we wanted you to type them out to make sure you give them some thought.
In general, the format for typedef
is:
typedef
something that looks like a variable declaration
The name of the variable in that
something that looks like a variable declaration
becomes a new type that you can use
to declare new variables or parameters.
typedef unsigned char uchar
makes a type called uchar
that is an unsigned char. It just saves you from having to type unsigned char
a lot.
typedef uint64_t Frequencies
makes a type called Frequencies
that is an array of 256 integers. Each integer is of type uint64_t
which is a fixed-size integer type. Unlike int
and unsigned int
, which can vary in size between platforms, uint64_t
is guaranteed to be 64 bits (8 bytes) on any platform where it is used. Using uint64_t
requires #include <stdint.h>
so be sure to put that at the top of your frequencies.h. When you use the Frequencies
type, it is the same as an array of uint64_t[256]
. It saves you a bit of typing. More importantly, it expresses exactly what you are using the data for.
You have been using typedef
for the past few weeks in linked lists and BSTs. Hopefully, none of this will be too surprising.
Now, for the main part of this homework, you will be writing a single function: calc_frequencies(…)
. It will read all of the characters in a file and calculate the frequency of every character.
The parameter Frequencies freqs
will be an array of 256 integer values. Upon success, freques[ch]
will be set to the number of occurrences of character ch
in the file path
.
Keep in mind that a file is just a sequence of bytes. A given byte can have 256 possible values, from 0 to 255. If the file is plain ASCII text, then most of the values will be in the range of 32 (space) to 126 ('~'
). A few will be 10 (\n
) or 9 (\t
). Other kinds of files may contain any of the values from 0 to 255. For this homework, we make no assumptions. We must use unsigned char
because a char
assumes a range of -128 to 127, which cannot accommodate values ≥128.
Your calc_frequencies(…)
will open the file and read every character.
In case the file does not exist, or some other error occurs when opening the file, it will return false
and pass an error message back to the caller through a parameter.
Otherwise, it will return true
and pass the Frequencies
back via a parameter.
When fopen(…)
fails for any reason, it will return NULL
. To get a human-readable error message, call strerror(errno)
. The errno
global variable is an integer value that indicates the type of error most recently encountered.
To access errno
, you must #include <errno.h>
.
The strerror(…)
function returns a string message for a given errno
value. You can assume the string is on the data segment. You do not need to free it.
To access strerror(…)
, you must #include <string.h>
.
Tip: Remember that when creating a text file in most text editors, a newline ('\n'
) will be added to the last line of any non-empty file.
Tip: To see a more precise representation of the contents of any file, call xxd filename
from bash.
Tip: To initialize an array to all 0, use something like this. (This is an illustration. Do not copy this verbatim. It won't help you.)
int array_of_zeros[500] = {0};
You can also initialize an array of integers with all zeros, except for a few select values.
int array_of_mostly_zeros[10] = { [2]=100, [6]=100, [4]=100 };
The above example is equivalent to int array_of_mostly_zeros[10] = { 0, 0, 100, 0, 100, 0, 100, 0, 0, 0 };
Tip: Consider making a helper function like _print_frequiencies(Frequencies freqs)
in your test_frequencies.c that prints the frequencies in a nicely formatted manner. This may help you troubleshoot in case of problems.
Tip: C functions cannot return a static array. For example, a function cannot return uint64_t[256]
. Likewise, no function can return Frequencies
because that type is equivalent to uint64_t[256].
How much work is this?
In the instructor's solution, the body of the calc_frequencies(…)
comprises only 11 sloc (source lines of code).
Requirements
- Your submission must contain each of the following files, as specified:
file contents frequencies.h type definitions typedef unsigned char uchar;typedef uint64 t Frequencies[256];function declaration calc frequenciesFunction declaration. This is provided for you.frequencies.c functions calc frequencies(Frequencies freqs, const char✶ path, const char✶✶ a error)
→ return type: boolOpen a file atpath
and either store the character frequencies infreq
or set*a_error
tostrerror(errno)
.- This is the main part of this homework.
- Caller is responsible for initializing
freqs[ch]
to 0 for all ch in 0…255. Thus,calc_frequencies(…)
may assume that all freqs[▒]==0. - If the file is opened correctly, then set freqs[ch] to n, where n is the number of occurrences of character
ch
in the file atpath
. Then, returntrue
. Do not modifya_error
. - If the file could not be opened (i.e.,
fopen(…)
returnedNULL
, set*a_error
tostrerror(errno)
and returnfalse
. Do not modifyfreqs
. - You only need to check for errors related to failure to open the file (
fopen(…)
). - This should not print anything under any circumstances.
test_frequencies.c functions main(int argc, char✶ argv[])
→ return type: int- 100% code coverage is required.
- Use your miniunit.h.
- You must test at least one error condition (e.g., attempting to open a file that does not exist).
- You do not need to check for other kinds of errors.
- ⚠ Do not call
malloc(…)
orfree(…)
anywhere in this homework. You don't need them. - Submissions must meet the code quality standards and the course policies on homework and academic integrity.
Submit
To submit HW16 from within your hw16 directory, type
make submit
That should result in the following command:
264submit HW16 frequencies.h frequencies.c test_frequencies.c Makefile miniunit.h clog.h
You may include other files, but they must end with ".txt" to avoid conflicting with our files.
Pre-tester ●
The pre-tester for HW16 has been released and is ready to use.
Q&A
Updates
Updates may be added here.