Data Structures

Fall 2023 ECE 36800 :: Purdue University

Due 12/9

Hashing

Learning goals

You should learn how to:

  1. Design a hash function suitable for use in hash tables (data structures).
  2. Efficiently discover successive prime numbers.

This assignment does not involve creating a hash table. The focus is on the hashing function.

Overview

Hashing is the process of generating a fixed-size number of byte-string (called a hash or hash value) from variable-sized input data. This process is performed by a hash function.

An ideal hash function will generate hashes that are uniformly distributed over the set of possible (or probable) inputs. Any change to the input—even a single byte or bit—should result in a seemingly random change to every bit in the hash.

If you have two objects (i.e., strings or other inputs) that result in the same hash, there should be very high probability that the two objects are identical, even without comparing them directly.

Conversely, with an ideal hash function, the probability that two files (or other inputs) result in the same hash, despite being different, should be very low. Of course, it is limited by the number of bits in the hash. For example, if your hash function generates 128-bit hashes, and you have two files that, when hashed, both result in the same 128 bits, then the probability that those two files are actually different should be $\frac{1}{2^{128}}$.

Generating a hash, given some input, is easy. However, the reverse is not true; given a hash, it is difficult to find an input that results in that hash. For that reason, hashing is used extensively in cryptography, including asymetric encryption, signing, protocols for secure network connections.

Hashing is used in a variety of areas of computing:

This assignment

For this assignment, you will focus on the hash function itself. You do not need to construct a hash table.

You will code two components that are important to any application of hashing:

In the case of hash tables, the capacity of a hash table (i.e., the number of slots) is normally a prime number. To generate the hash, we represent the input as a very large number (conceputally), and then mod that number with the capacity of the hash table.

Example: Suppose your input data is just "ABC" (3 characters). The ASCII values of those characters are 65, 66, and 67. In hexadecimal, that would be written 0x41, 0x42, and 0x43. If we concatenate those hexadecimal numbers, we get 0x414243. That can be interpreted as a single integer (4,276,803). Numerically, you can think of that number as $$65 * 256^2 + 66 * 256^1 + 67 * 256^0 = 4,276,803$$.

We then mod that number with the capacity of the hash table, which is normally a prime number. For example, if the hash table had 31 slots, we would calculate the hash as $$4,276,803 \% 31 = 12$$.

Prime numbers give us a special property that when even a small change in the input will usually result in a change to every bit of the hash. Obviously, it depends on the specific inputs and outputs, but using prime numbers avoids some problems that would otherwise make the hash values more clustered and/or predictable.

Although we are not doing a hash table for this assignment, those properties are useful for many applications of hashing.

Since we need to use prime numbers, this assignment also asks you to write a function that generates the next prime number greater than some parameter n.

Getting Started

Get the starter code

Change to your directory for this course.

you@ecegrid-thin1 ~/ $ cd 368

Get the starter code.

you@ecegrid-thin1 ~/368/ $ 368get ec01
This will write the following files: … 1. ec01/000.A.txt 2. ec01/001.B.txt 3. ec01/002.C.txt 4. ec01/003.AA.txt 5. ec01/004.AA.txt 6. ec01/005.AB.txt 7. ec01/006.B.txt 8. ec01/007.AC.txt 9. ec01/008.BC.txt 10. ec01/009.YY.txt 11. ec01/010.YZ.txt 12. ec01/011.B.txt 13. ec01/012.ZZ.txt 14. ec01/013.ABCD.txt 15. ec01/014.ABCDEFGHIJKLMNOPQRSTUVWXYZ.txt 16. ec01/015.ABCD.txt 17. ec01/016.PU.txt 18. ec01/017.ECE.txt 19. ec01/018.ECE.txt 20. ec01/019.SO.txt 21. ec01/020.ABCDEFGHIJKLMNOPQRSTUVWXYZ.txt 22. ec01/Makefile 23. ec01/credit.txt 24. ec01/file_utils.c 25. ec01/file_utils.h 26. ec01/hashing.c 27. ec01/hashing.h 28. ec01/test_calc_hash_of_file_content.c 29. ec01/test_file_explanations.txt 30. ec01/test_get_next_prime.c 31. ec01/test_hashing.c Ok? (y/n)y 31 files were written
you@ecegrid-thin1 ~/368/ $ cd ec01
you@ecegrid-thin1 ~/368/ec01 $

For your convenience, we are providing a Makefile to make it more convenient to compile, test, and submit your code. To see the commands that are available, open it up.

you@ecegrid-thin1 ~/368/ec01 $ vim Makefile

In summary:

  • make … compiles your code.
  • make test … compiles your code and runs ./test_hashing.
  • make submit … submits the required files. (Double-check!!!)
  • make valgrind … compiles your code and checks for memory faults using Valgrind.

Requirement: incremental submissions

You are required to submit at least 6 times, at least 10 minutes apart with significant differences in the code—and the amount of working functionality—between these submissions. Submitting even more often is strongly encouraged as it saves backups of your work, in case something goes wrong.

Requirements

  1. Your submission must contain all files included in the starter code. We will only check the following file:
    file contents
    hashing.c functions
    calc hash of file content(char const✶ path, uint64 t✶ a hash value, size t capacity)
    return type: bool
    Calculate the hash for the bytes in the file at path.
    • path is the filename (or path) to the file.
    • Assume the file might be a binary file, i.e., bytes may have values from 0 to 255.
    • capacity is the size of the hash space (i.e., number of distinct hash values possible). If we were using this for a hash table, this would be the capacity of the hash table.
    • If you were able to successfully open and read the contents of the file, set *a_hash_value to the resulting hash value and return true.
    • If you were unable to open the file, return false, and do not modify *a_hash_value.
    • You may follow the outline given in the notes for our lecture (from Prof. Koh's slides), but it will not work as is. You need to understand what it is doing.
    get next prime(unsigned long n)
    return type: unsigned long
    Return the next prime number greater than n.
    • n may be any integer ≥0.
    • You will need to use trial division to test for primality, but be efficient.
      • Avoid testing numbers that are divisible by 2 or 3 (except when n ≤ 4).
      • Note that ever prime number can be expressed as $6k + i$ where k is an integer and i ∈ {1, 5}. You can use that to structure your code.
      • Do not use any helper functions.
      • We may or may not test the efficiency of your code, but it isn't too hard to do and it is good to know. It doesn't add much difficulty to this.
    • This will be worth half the credit for this assignment.
  2. Do not modify the function signatures of the required functions.
  3. You may use the functions in file_utils.c, if you find them useful. Some of those functions relate to a broader version of this assignment. (We had originally planned to have you write a program to find duplicate files. We reduced the assignment but left some of those support functions in there. Those who are interested may complete the rest for fun (no additional credit). Ask and we can give you the details.
  4. We have provided tests for you in test_hashing.c.c.
  5. In case of any corrections announced by email, apply them and ensure that your code works.
  6. Running time should be reasonable for each operation. Follow the spirit of the assignment.
  7. Only the following external header files, functions, and symbols are allowed in hashing.c.
    header functions/symbols allowed in…
    stdbool.h bool, true, false *
    stdio.h * *
    stdint.h * *
    assert.h assert *
    stdlib.h * *
    All others are prohibited unless approved by the instructor. Feel free to ask if there is something you would like to use.
  8. Use create_graph_with_no_edges(…) to create a graph structure to start with. Caller is responsible for freeing.
  9. Follow the policies on homework (including the base requirements) and academic integrity. For example:
    • Do not copy code from generated by Chat-GPT (or other GenAI tools).
    • Do not copy code found on the web.

Submit

To submit EC01 from within your ec01 directory, type

make submit

That should result in the following command: 368submit EC01 hashing.c test_hashing.c hashing.h file_utils.h file_utils.c credit.txt Makefile

Q&A

Updates

There are no updates to EC01 so far.