Entries in hash table (1)

Tuesday
Jan272015

Just for Fun: Hash Table Implementation in C

TLDR: Take a look at the code on GitHub!

Hash tables are cool. Like many developers, I used them for years without giving much thought to how they actually work. Just for fun, I decided to look into it - and while I was at it, brush up on C, which I hadn't spent much time with since my first internship years ago.

Example: Hash Tables in Go

Hash tables (also known as "hash maps", or "maps") are so useful that they're a built-in type in the arguably minimalist Go programming language. They allow you to create a simple lookup with a key/value pairing.

For example, if you forget which Wiggle wears which color, you could create a map of color (key) to Wiggle name (value), then refer to them by color through the map (try it in the Go Playground):

 

What are Hash Tables Doing?

Conceptually, hash tables are pretty simple. Under the hood, a hash table has a bunch of buckets for its values. The hash table knows in which bucket to find a value by its key, so it can quickly find anything it's looking for. It's much like how you'd keep track of your old baseball cards in boxes. You have a system - in this case, a box per year. You have 50 boxes, but when you're looking for a specific card, you jump right to the correct box. Once you open that box, you rummage through it until you find the right card. As the number of buckets increases, hash table lookups approach constant time, O(1).

But... How Do They Work?

A hash table references a fixed number of buckets which can be represented as an array of simple linked lists. As we give the hash table a value to store, the key is fed through a a hashing algorithm which outputs a long integer called that key's "hash code". We mod that hash code by the number of buckets to figure out which bucket to use, then add the key and value to the end of the linked list in that bucket, creating the list if it doesn't yet exist.

For example... Say you have 10 buckets with indexes 0 through 9, and a key with hash code equal to 223. 223 mod 10 is 3, so we'll append the key/value to the linked list in bucket with index 3. If you had 1,000,000 buckets, then you'd append the key/value to the linked list in bucket with index 223, since 223 mod 1000000 is 223.

When you search for a value in a hash table by key, the same hashing algorithm is used on the key to figure out which bucket to look in. That bucket's linked list is traversed until the key is found, using an equality algorithm.

Java: hashCode() and equals()

If you've worked with Java, you've probably implemented hashCode()) a bunch of times, or more likely, copied and pasted from somewhere. You've no doubt heard that it's important to implement it carefully, but your program worked just fine with whatever you implemented.

How are these methods used in hash tables, and why are they important? Great questions! If you're using one of your custom types for a hash table's key, then your hashCode() method is used to determine which bucket to look in. Once the bucket is found, the match is determined with your equals() method, given the input key and keys in the bucket's list.

A poor implementation of hashCode() returns the same value for every input, reducing a hash table to a single bucket: a linked list. We'd lose all the performance benefits we were looking for with this data structure. Lookups go from constant time (O(1)) to linear time (O(n)). Implementing either improperly returns invalid results. You could store something in a hash table with a key, then get the wrong value, or no value at all when you try to retrieve it with the same key.

Hash Table Implementation in C

I suppose it's a little late in the post to finally get to my implementation, but without an understanding of how a hash table works, the code is just a bunch of marks on your screen. In my implementation, both the keys and values are strings, but it wouldn't be too much work to make it more flexible.

My two custom types are the hashTable and the linked list that represents each bucket:

Don't let struct hLinkedList **lists; scare ya - it's just an array that's sized and allocated later, when we determine how many buckets we need.

Here's my hashing function:

This uses the traditional multiplication by 31 scheme, which, by being prime, helps protect against losing information if the multiplication overflows. Since we're dealing with string functions, we use strcmp for the equality algorithm.

With my comments, and an understanding of C, the rest of the code should be fairly straightforward.

Demo

As described in README.md:

Compile and execute the binary to run the demo use case in main.c:

gcc *.c
./a.out

You should see:

Adding 'hello'=>'world', 'color'=>'blue' and printing:
-----
hello -> world
color -> blue


Changing 'hello' value to 'goodbye', then printing:
-----
hello -> goodbye
color -> blue


Removing 'hello' and printing:
-----
color -> blue


Removing 'color' and printing:
-----

Read My Codez!

Take a look at the source code on Github! Try modifying it to use a different type for either the keys or values.

Have your own fun!