Hash tables are cool. Like many developers, I used them for years without giving much thought to how they actually work. Just for fun, I decided to look into it - and while I was at it, brush up on C, which I hadn’t spent much time with since my first internship years ago.
Hash tables (also known as “hash maps”, or “maps”) are so useful that they’re a built-in type in the arguably minimalist Go programming language. They allow you to create a simple lookup with a key/value pairing.
For example, if you forget which Wiggle wears which color, you could create a map of color (key) to Wiggle name (value), then refer to them by color through the map (try it in the Go Playground):
What are Hash Tables Doing?
Conceptually, hash tables are pretty simple. Under the hood, a hash table has a bunch of buckets for its values. The hash table knows in which bucket to find a value by its key, so it can quickly find anything it’s looking for. It’s much like how you’d keep track of your old baseball cards in boxes. You have a system - in this case, a box per year. You have 50 boxes, but when you’re looking for a specific card, you jump right to the correct box. Once you open that box, you rummage through it until you find the right card. As the number of buckets increases, hash table lookups approach constant time, O(1).
But… How Do They Work?
A hash table references a fixed number of buckets which can be represented as an array of simple linked lists. As we give the hash table a value to store, the key is fed through a a hashing algorithm which outputs a long integer called that key’s “hash code”. We mod that hash code by the number of buckets to figure out which bucket to use, then add the key and value to the end of the linked list in that bucket, creating the list if it doesn’t yet exist.
For example… Say you have 10 buckets with indexes 0 through 9, and a key with hash code equal to 223. 223 mod 10”> is 3, so we’ll append the key/value to the linked list in bucket with index 3. If you had 1,000,000 buckets, then you’d append the key/value to the linked list in bucket with index 223, since 223 mod 1000000 is 223.
When you search for a value in a hash table by key, the same hashing algorithm is used on the key to figure out which bucket to look in. That bucket’s linked list is traversed until the key is found, using an equality algorithm.
Java: hashCode() and equals()
If you’ve worked with Java, you’ve probably implemented hashCode() a bunch of times, or more likely, copied and pasted from somewhere. You’ve no doubt heard that it’s important to implement it careflly, but your program worked just fine with whatever you implemented.
How are these methods used in hash tables, and why are they important? Great questions! If you’re using one of your custom types for a hash table’s key, then your hashCode() method is used to determine which bucket to look in. Once the bucket is found, the match is determined with your equals() method, given the input key and keys in the bucket’s list.
A poor implementation of hashCode() returns the same value for every input, reducing a hash table to a single bucket: a linked list. We’d lose all the performance benefits we were looking for with this data structure. Lookups go from constant time (O(1)) to linear time (O(n)). Implementing either improperly returns invalid results. You could store something in a hash table with a key, then get the wrong value, or no value at all when you try to retrieve it with the same key.
Hash Table Implementation in C
I suppose it’s a little late in the post to finally get to my implementation, but without an understanding of how a hash table works, the code is just a bunch of marks on your screen. In my implementation, both the keys and values are strings, but it wouldn’t be too much work to make it more flexible.
My two custom types are the hashTable and the linked list that represents each bucket:
Don’t let struct hLinkedList **lists; scare ya - it’s just an array that’s sized and allocated later, when we determine how many buckets we need.
JSON marshalling in Go is pretty straight-forward, but sometimes your structs contain sensitive information that you’d like to keep out of your public JSON API. Fortunately, there are a few good solutions that don’t require you to manually copy every field from one struct type to another.
Let’s start with our basic Person struct:
The simplest way to prevent exposing AuthToken via JSON is with the json:"-" tag, which tells the json to always ignore this field. Let’s call this SafePerson, because you never have to worry about exposing AuthToken via JSON:
This works really well, until you need to include the field some of the time, like when showing someone their own record. In that case, you could try the json:",omitempty" tag, which only includes the field if it has a value. It’d be up to you to clear it out when it doesn’t belong in the JSON. This is PrettySafePerson, because it’s safe if you’re careful:
This is great, but… here’s the thing. That one time you forget to clear an AuthToken for someone other than the Person that it represents might give you quite the headache. I’d rather err on the side of safety, where forgetting something means not showing data that I meant to.
Here’s where we can take advantage of Go’s type embedding. By embedding a Person inside another struct, we borrow all of the Person’s fields, and can override some without affecting the Person. We’ll build our API with this pattern, which always omit sensitive fields unless told otherwise.
PublicPerson’s AuthToken field overrides the embedded Person.AuthToken, and because of the json:",omitempty" key, it’ll be hidden from JSON unless you set it. Person.AuthToken still has the original value, you can include it in the JSON by calling IncludeAuthToken().
I didn’t win the Google I/O ticket lottery this year, but thought I’d try to find and solve one of the secret puzzles Google was hiding, to get a chance to buy a reserved ticket.
I noticed that the Google Developers page header contained the text “I/O”, with different colors for each of the letters. Refreshing the page changed their colors, and occasionally removed them from the page. Looking at the source code, I found this:
You might as well check out the hint url before reading on. Yep, it got me too. But… it turns out it is a hint.
The blocks of numbers in the comments on the last line were clearly binary, so I converted them to integers: 5,2,1,3,4,0. I refreshed the page several times, printing out lots of values for the hex color of the “I” and “O”, and with the resulting binary->integer conversions. I didn’t see any real pattern in the hex color values, but did notice that each of the binary->integer conversions always had exactly one of each 0,1,2,3,4,5 - this looks like a series of indexes.
Going back to our hint, I realized that the goal is to come up with the 6-character code to use as a Google shortened URL. Rather than 6 characters, it appears that we have 12 and a series of 6 indexes, but really, those color codes each have 3 hex components:
And now, convert them from decimal to ASCII characters:
Let’s apply the indexes (5,2,1,3,4,0) by rearranging these letters so that the 5th 0-based index character is first, then the 2nd, 1st, 3rd, 4th, and 0th:
Prefix that code with the Google URL-Shortening domain, and we have:
Success! Well, sort of. By the time I solved this, all of the codes seemed to be taken. Of hundreds of codes generated, only a few worked. I’m not sure if some random ones were thrown in, or if I’m missing another piece to the puzzle. In any case, the shortened link redirects you to a fun console-based game, reminiscent of the old terminal-based Zork and Hitchhiker’s Guide to the Galaxy games, but with color and simple animations.
At the time of this writing, this game link still works, so have a go at it. Take a look at my Python code that generates URLs by refreshing the Google Developers page and calculating the secret shortened URL.
I’ve known entities are expensive, but wanted to see for myself, so I built this test project to run some benchmarks. The tests aren’t complete - just a simple query with no joins, but with a lot of data. The point was to see how much overhead is introduced after we have the query results.
About the Project
This is a simple Maven Spring project that creates an in-memory HSQLDB database, populates it 500,000 records, and then uses several Hibernate query strategies to fetch every one, and report on their average execution times.
Approaches Tested
Using a JpaRepository interface’s findAll method to return a list of attached Hibernate entities.
Using Hibernate’s StatelessSession interface to return a list of detached Hibernate entities.
Selecting the specific fields of the entity, using Hibernate to return a simple List<Object[]>, and then manually converting that list to a list of detached entities (as DTOs, basically).
Selecting the specific fields of the entity, then using Hibernate’s AliasToBeanResultTransformer to build a list of detached entities (as DTOs, basically).
Changing Execution Parameters
By default, the database is loaded with 500,000 records, and each test is repeated in its own transaction 10 times. You can change both of these values in src/main/resources/application.properties.
Running Tests
This big of a database does take up over 256MB of memory, so you might have to increase your heap space. If you run the tests from Maven, you should be fine, since I increase it in the plugin’s settings.
Download the sample project and run the following command from inside its directory:
mvn clean test
The test might take several minutes to run. At the end, the test will output the results.
My Results
The results are listed slowest to fastest:
The individual times will vary on different systems - the relative performance is what’s important.
The StatelessSession query was a little more efficient than returning attached entities, but still pretty slow for this big query. The AliasToBeanResultTransformer and my custom List<Object[]> -> DTO approaches tied as the best performers. I was hoping for this result, but worried that the reflective nature of AliasToBeanResultTransformer might have introduced some overhead. It did not.
Problems? Let me know!
I tried being as careful as I could with these tests:
took the average of several test runs
turned off Hibernate’s second-level cache
turned off Hibernate’s query cache
cleared the entity manager before each run
ran each test in its own transaction
ignored the first query in the transaction as to avoid any initial performance hit from opening it
I encourage you to download the project, take a look at the code, and try it out for yourself. If you see any issues with my methodology, please let me know, and I’ll correct for it.
I heavily rely on the Hibernate framework on all of my database-driven Java projects. It’s a full-featured cross-database ORM framework capable of managing your database schema, persisting your entities to the database, populating entities from the database, caching queries, caching entities, registering entities for lifecycle events, indexing them in a search index, and just about everything else you could ask of an ORM.
When you fetch an entity from Hibernate, it remains attached to your EntityManager for the duration of that transaction. The EntityManager is responsible for all of the magic - it watches the entity for changes, persisting it and calling lifecycle methods when appropriate, makes sure to return the same instance of an entity for multiple queries to the same record, builds and executes additional queries if you access properties that point to associations in the database, and of course much more.
Take a look at my hibernate-perf-test sample project to see the relative performance of different Hibernate query strategies.
Entities are Expensive!
Entities are convenient, but if you don’t understand what they’re doing, you can get yourself into trouble. It all boils down to the fact that an entity’s getter method may run several more queries, so long as the entity is still attached to the EntityManager. A developer can be excused at first, because we’re used to getters and setters containing little to no logic besides accessing a private member variable. Hibernate is an amazing framework, but its greatest strength: being easy to use, becomes a liability: it’s too easy to use it to thrash your database.
The most common and expensive errors that I see are due to:
Accidental “eager” wiring of associated entities
When you load a Customer that has a list of Orders that are wired “eagerly”, Hibernate will build and execute a query to fetch all of those Orders, even if the code never accesses that list. The Orders can have their own eager associations, which compounds this problem. This becomes performance-crippling when your User table is eagerly self-referencing, and just about every query loads up your entire database. This is the sort of issue you might not notice during development, with 10 records in your database, but believe me, it shows itself in production.
Too many queries by accessing an association while looping over entities
When you’re displaying a table of 20 User records, and each one needs the name of their Organization, the easiest path for that developer is to loop over the list of entities and access the “organization” property on each User. This is intuitive, and makes sense with how we think about objects - you have a User, the User has an Organization, and the Organization has a name. However, with Hibernate, you need to understand that Organization isn’t loaded until you access it. By fetching each User’s Organization this way, you’re producing at least 20 more queries. Now, imagine if the user chooses a table size of 50 rows, or if Organization had some eagerly-loading associations to surprise you with (like another User record!). You’ve just killed your action.
This comes up a lot when a developer tries to do the right thing and adds a lot of logging. Even if we don’t need the Organization for the data table we’re building, the developer might think that someone that reads the logs might want to see each User’s organization. Those logs just became very expensive, and nobody will notice until your users are complaining about page load times in production.
Yes, there are better ways to use entities
If you’re familiar with the framework, you’re probably shaking your head, mumbling something about “fetch joins, closing the transaction before building your view, and other strategies for more efficiently fetching entities. My point is that as much as I love entities, they’re performance time bombs. It takes one careless moment, or one developer that doesn’t know Hibernate as well as you do, to touch the wrong property and cause a big issue to ripple through your system.
Querying Carefully
Rather than maintain constant vigilance and make sure that every member of your team has read all of the documentation, I prefer to tread lightly with Hibernate. I fully embrace it for helping me generate my schema, for updating the database in an infrequent-write system, and for the query language that bridges different databases.
There are several ways you can use Hibernate to query your database, from full-on magic entities to getting your hands dirty for better efficiency. In a system where reads are common and writes rare, my approach is typically to use as little “magic” as I can for my read-only queries. I don’t fetch attached entities from the database, but rather, select the specific fields that I need, and use the results to build detached data transfer objects (DTOs).
Aside from avoiding the big issues above, this also forces a developer to think more in terms of SQL, and where their data is coming from - it’s more explicit, and thus, more understandable. If you have a “dumb” User DTO, and want the name of that User’s Organization, you’re going to have to either run a query on each User, query them in bulk and then join the two sets of data, or add the Organization name to the original User query. It’s immediately obvious that the first option is ridiculous, and that the second one is annoying to write. The last one takes the least effort, and makes sense when you’re thinking in terms of SQL and database access.
Here’s another point to consider. Since it’s typically considered poor form to return your entities to the client or view, then why not avoid the entity-to-DTO conversion and generate the DTOs in the first place?
Coming Up: Performance Testing Different Query Approaches
In my next post, I’ll walk you through a sample project that I wrote to test out different query strategies. If you’re interested, take a look for yourself.