Tuesday
Jan272015

Just for Fun: Hash Table Implementation in C

TLDR: Take a look at the code on GitHub!

Hash tables are cool. Like many developers, I used them for years without giving much thought to how they actually work. Just for fun, I decided to look into it - and while I was at it, brush up on C, which I hadn't spent much time with since my first internship years ago.

Example: Hash Tables in Go

Hash tables (also known as "hash maps", or "maps") are so useful that they're a built-in type in the arguably minimalist Go programming language. They allow you to create a simple lookup with a key/value pairing.

For example, if you forget which Wiggle wears which color, you could create a map of color (key) to Wiggle name (value), then refer to them by color through the map (try it in the Go Playground):

 

What are Hash Tables Doing?

Conceptually, hash tables are pretty simple. Under the hood, a hash table has a bunch of buckets for its values. The hash table knows in which bucket to find a value by its key, so it can quickly find anything it's looking for. It's much like how you'd keep track of your old baseball cards in boxes. You have a system - in this case, a box per year. You have 50 boxes, but when you're looking for a specific card, you jump right to the correct box. Once you open that box, you rummage through it until you find the right card. As the number of buckets increases, hash table lookups approach constant time, O(1).

But... How Do They Work?

A hash table references a fixed number of buckets which can be represented as an array of simple linked lists. As we give the hash table a value to store, the key is fed through a a hashing algorithm which outputs a long integer called that key's "hash code". We mod that hash code by the number of buckets to figure out which bucket to use, then add the key and value to the end of the linked list in that bucket, creating the list if it doesn't yet exist.

For example... Say you have 10 buckets with indexes 0 through 9, and a key with hash code equal to 223. 223 mod 10 is 3, so we'll append the key/value to the linked list in bucket with index 3. If you had 1,000,000 buckets, then you'd append the key/value to the linked list in bucket with index 223, since 223 mod 1000000 is 223.

When you search for a value in a hash table by key, the same hashing algorithm is used on the key to figure out which bucket to look in. That bucket's linked list is traversed until the key is found, using an equality algorithm.

Java: hashCode() and equals()

If you've worked with Java, you've probably implemented hashCode()) a bunch of times, or more likely, copied and pasted from somewhere. You've no doubt heard that it's important to implement it carefully, but your program worked just fine with whatever you implemented.

How are these methods used in hash tables, and why are they important? Great questions! If you're using one of your custom types for a hash table's key, then your hashCode() method is used to determine which bucket to look in. Once the bucket is found, the match is determined with your equals() method, given the input key and keys in the bucket's list.

A poor implementation of hashCode() returns the same value for every input, reducing a hash table to a single bucket: a linked list. We'd lose all the performance benefits we were looking for with this data structure. Lookups go from constant time (O(1)) to linear time (O(n)). Implementing either improperly returns invalid results. You could store something in a hash table with a key, then get the wrong value, or no value at all when you try to retrieve it with the same key.

Hash Table Implementation in C

I suppose it's a little late in the post to finally get to my implementation, but without an understanding of how a hash table works, the code is just a bunch of marks on your screen. In my implementation, both the keys and values are strings, but it wouldn't be too much work to make it more flexible.

My two custom types are the hashTable and the linked list that represents each bucket:

Don't let struct hLinkedList **lists; scare ya - it's just an array that's sized and allocated later, when we determine how many buckets we need.

Here's my hashing function:

This uses the traditional multiplication by 31 scheme, which, by being prime, helps protect against losing information if the multiplication overflows. Since we're dealing with string functions, we use strcmp for the equality algorithm.

With my comments, and an understanding of C, the rest of the code should be fairly straightforward.

Demo

As described in README.md:

Compile and execute the binary to run the demo use case in main.c:

gcc *.c
./a.out

You should see:

Adding 'hello'=>'world', 'color'=>'blue' and printing:
-----
hello -> world
color -> blue


Changing 'hello' value to 'goodbye', then printing:
-----
hello -> goodbye
color -> blue


Removing 'hello' and printing:
-----
color -> blue


Removing 'color' and printing:
-----

Read My Codez!

Take a look at the source code on Github! Try modifying it to use a different type for either the keys or values.

Have your own fun!

Tuesday
Jan272015

Introduction to Docker: Tech Talk and Demo

There's no shortage of passion for technology at Fog Creek Software. Each Friday, we take turns sharing our interests and hobbies with the rest of the company through tech talks. I've spent a bunch of time tinkering with Docker over the past few months, so recently, I took a turn.


Watch/read on blog.fogcreek.com!

Tuesday
Jan202015

JSON Encoding in Go: Dealing with Sensitive Fields

JSON marshalling in Go is pretty straight-forward, but sometimes your structs contain sensitive information that you'd like to keep out of your public JSON API. Fortunately, there are a few good solutions that don't require you to manually copy every field from one struct type to another.

Let's start with our basic Person struct:

The simplest way to prevent exposing AuthToken via JSON is with the `json:"-"` key, which tells the JSON package to always ignore this field. Let's call this "SafePerson", because you never have to worry about exposing AuthToken via JSON:

This works really well, until you need to include the field some of the time, like when showing someone their own record. In that case, you could try the `json:",omitempty"` key, which only includes the field if it has a value. It'd be up to you to clear it out when it doesn't belong in the JSON. This is "PrettySafePerson", because it's safe if you're careful:

This is great, but... here's the thing. That one time you forget to clear an AuthToken for someone other than the Person that it represents might give you quite the headache. I'd rather err on the side of safety, where forgetting something means not showing data that I meant to. 

Here's where we can take advantage of Go's type embedding. By embedding a Person inside another struct, we borrow all of the Person's fields, and can override some without affecting the Person. We'll build our API with this pattern, which always omit sensitive fields unless told otherwise:

PublicPerson's AuthToken field overrides the embedded Person.AuthToken, and because of the `json:",omitempty"` key, it'll be hidden from JSON unless you set it. Person.AuthToken still has the original value, you can include it in the JSON by calling IncludeAuthToken().

Try it out in the Go Playground

View Source on Github

Wednesday
Apr232014

Solving a 2014 Google I/O Secret Invite Puzzle

I didn't win the Google I/O ticket lottery this year, but thought I'd try to find and solve one of the secret puzzles Google was hiding, to get a chance to buy a reserved ticket.

I noticed that the Google Developers page header contained the text "I/O", with different colors for each of the letters. Refreshing the page changed their colors, and occasionally removed them from the page. Looking at the source code, I found this:

<span style="color: #386834">I</span>/<span style="color: #317368">O</span>
<!-- Still need a hint? http://goo.gl/Eypmp -->
</li><!-- 110101 110010 110001 110011 110100 110000  -->

You might as well check out the hint url before reading on. Yep, it got me too. But... it turns out it *is* a hint.

The blocks of numbers in the comments on the last line were clearly binary, so I converted them to integers: 5,2,1,3,4,0. I refreshed the page several times, printing out lots of values for the hex color of the "I" and "O", and with the resulting binary->integer conversions. I didn't see any real pattern in the hex color values, but did notice that each of the binary->integer conversions always had exactly one of each 0,1,2,3,4,5 - this looks like a series of indexes.

Going back to our hint, I realized that the goal is to come up with the 6-character code to use as a Google shortened URL. Rather than 6 characters, it appears that we have 12 and a series of 6 indexes, but really, those color codes each have 3 hex components:

0x38, 0x68, 0x34, 0x31, 0x73, 0x68

Convert these to decimal:

56, 104, 52, 49, 115, 104

And now, convert them from decimal to ASCII characters:

8, h, 4, 1, s, h

Let's apply the indexes (5,2,1,3,4,0) by rearranging these letters so that the 5th 0-based index character is first, then the 2nd, 1st, 3rd, 4th, and 0th:

h4h1s8

Prefix that code with the Google URL-Shortening domain, and we have: http://goo.gl/h4h1s8

Success! Well, sort of. By the time I solved this, all of the codes seemed to be taken. Of hundreds of codes generated, only a few worked. I'm not sure if some random ones were thrown in, or if I'm missing another piece to the puzzle. In any case, the shortened link redirects you to a fun console-based game, reminiscent of the old terminal-based Zork and Hitchhiker's Guide to the Galaxy games, but with color and simple animations. 

At the time of this writing, this game link still works, so have a go at it.  Take a look at my Python code that generates URLs by refreshing the Google Developers page and calculating the secret shortened URL.

TLDR: Game Link, Python Code

Here's a screenshot with all of the different options selected.

Sunday
Nov102013

Performance Testing Hibernate Query Approaches

I've known entities are expensive, but wanted to see for myself, so I built this test project to run some benchmarks. The tests aren't complete - just a simple query with no joins, but with a lot of data. The point was to see how much overhead is introduced after we have the query results.

About the Project

This is a simple Maven Spring project that creates an in-memory HSQLDB database, populates it 500,000 records, and then uses several Hibernate query strategies to fetch every one, and report on their average execution times.

Approaches Tested

  1. Using a JpaRepository interface's findAll method to return a list of attached Hibernate entities.

  2. Using Hibernate's StatelessSession interface to return a list of detached Hibernate entities.

  3. Selecting the specific fields of the entity, using Hibernate to return a simple List, and then manually converting that list to a list of detached entities (as DTOs, basically).

  4. Selecting the specific fields of the entity, then using Hibernate's AliasToBeanResultTransformer to build a list of detached entities (as DTOs, basically).

Changing Execution Parameters

By default, the database is loaded with 500,000 records, and each test is repeated in its own transaction 10 times. You can change both of these values in src/main/resources/application.properties.

Running Tests

This big of a database does take up over 256MB of memory, so you might have to increase your heap space. If you run the tests from Maven, you should be fine, since I increase it in the plugin's settings.

Download the sample project and run the following command from inside its directory:

mvn clean test

The test might take several minutes to run. At the end, the test will output the results.

My Results

The results are listed slowest to fastest:

  ---------------------------
  Testing JpaRepository query
  Total # of runs: 10
  JpaRepository avg time: 1073.5ms

  ---------------------------
  Testing stateless session query
  Total # of runs: 10
  Stateless Session avg time: 818.5ms

  ---------------------------
  Testing RowData query
  Total # of runs: 10
  RowData avg time: 317.7ms

  ---------------------------
  Testing ResultTransformer query
  Total # of runs: 10
  ResultTransformer avg time: 311.9ms

The individual times will vary on different systems - the relative performance is what's important.

The StatelessSession query was a little more efficient than returning attached entities, but still pretty slow for this big query. The AliasToBeanResultTransformer and my custom List -> DTO approaches tied as the best performers. I was hoping for this result, but worried that the reflective nature of AliasToBeanResultTransformer might have introduced some overhead. It did not.

Problems? Let me know!

I tried being as careful as I could with these tests:

  • took the average of several test runs
  • turned off Hibernate's second-level cache
  • turned off Hibernate's query cache
  • cleared the entity manager before each run
  • ran each test in its own transaction
  • ignored the first query in the transaction as to avoid any initial performance hit from opening it

I encourage you to download the project, take a look at the code, and try it out for yourself. If you see any issues with my methodology, please let me know, and I'll correct for it.