Hash table

What is a hash?

A hash is used across technologies and domains. As a result the word has a few meanings. To give a simple definition, as it relates to a hash table, a hash is a value (usually a number) you get back from a hash function that represents what was hashed. The thing could be an object or a primitive type.

So you call a hash function, you get a value back, simple right? The important bit that becomes very powerful in computing is the fact that the next time you supply the hash function an identical object, it creates an identical hash. I’ll explain in a bit, first we need more definitions.

The way is which a hash function generates a hash value is called a hashing algorithm. There’s a few requirements for a good hashing algorithm that I won’t get into here. Just know that there are rules and without following them hash tables don’t work.

What is a table?

In the context of this post, a table is a place you can put your data, like an array.

What is a Hash table?

At a high level a hash table is a key-value map. The key in a hash table is the output from the hash function and your value is the item, object, etc… you’d like to retrieve/store. When retrieving/storing from a hash table you only interface with the hash function. The hash function takes your key and uses it to generate a hashcode that indicates the index where your item is or will be stored.

The thing that makes hash tables powerful is their speed. Getting things out of a hash table is very fast. The speed of hash tables is relative to classic data structures like arrays and lists. Storing and item in most data structures is pretty quick and easy. Retrieving items from a classic data structure like an array is painfully slow because it requires you to check each position of the array until you find what you’re looking for. The larger your array the slower the retrieval.

Since the speed of data structures is highly variable, the speed of a data structure is generalized using Big O notation. Iteratively checking each position of an array is O(n) or linear time. Which means that as the array grows so does the time it takes to search it. There are search algorithms that can find a result faster than linear time and a clever combination of a sorting algorithm and a search algorithm might find a result even faster still. However the combination would slow down as the size of its input grows.

The neat thing about hash tables is that they can be searched in O(1) or constant time. Constant time means that no matter how much data is in the hash table, it will find a result in the same number of steps. So it can search one million table entries at the same speed as it searches one thousand table entries. The speed of the lookup steps depends on the hash function.

If you’ve read this far and are wondering why you’ve never heard of this incredibly powerful data structure, it’s because most languages do not refer to hash tables by their formal name. Languages instead have names for their individual implementations of hash tables. For example, Java has the HashMap and HashTable classes. HashMap being the more common one. I don’t think I’ve seen HashTable used before.

What is a collision?

A collision is what happens when your hash function generates a new hash/key and the corresponding spot in the table is already occupied. In case that wasn’t clear, here’s a step-by-step: 1. Your hash function is given a piece of data to store. 2. The hash function calculates a hash/key from the data that corresponds to a position in the table. 3. Your hash function checks the position it calculated and finds that a previously calculated value is already present. 4. ???

Step 3 is called a collision and step 4 is how we deal with it. Depending on how collisions are handled a given hash table implementation might become slower or even useless over time.

How to handle collisions?

Collisions can be handled in a number of ways. The general idea of collision resolution is “Oh crap, that spot is taken, I need to find somewhere to put this data”. A very simple naive way of handling it is to simply move one space over from the position given by the hash function and put the data there. This strategy is called “linear probing”. You can find similar styles of collision resolution that involve probing by looking up “open addressing”.

The problem with collision resolution strategies that involve probing is that they tend to create sequential positions in the table that are already filled. This seems fine because you created the table to fill it, right? Well the problem is that the hash function will continue to generate positions for new values and it could select the position that the last collision just filled. As a result, it’ll probe again and put a value in the next available position. See the problem? Put simply, each collision increases the likelihood of future collisions and increases the lookup time for entries in the table. This is called “clustering”. Clustering will ruin the performance of a good hash table.

Another collision resolution method is to simply create a list in the spot where the collision occurred and add the new data to the end of the list. This strategy is called chaining and is usually done with linked lists.

What’s a “load factor”?

Another important concept related to hash tables is “load factor”. Load factor is the percentage of the hash tables slots/positions that are filled. The load factor is a simple metric that a hash table’s hashing algorithm can use to determine what actions to take to preserve performance. If your load factor is low then it’s unlikely that your lookup or storage times will slow down because if there’s a decent amount of spaces left in the table then your hash function still has a decent chance of selecting an open position. If your load factor is high then your hash tables lookup and storage times may slow to a crawl. One way to preserve performance would be to resize the table once the load factor is over fifty percent and rehash/redistribute all the values in relation to the new table size.

When would you use it?

When would you use a hash table? Anytime you possibly can. I bet you’re already using hash tables in whatever language you normally use. Basically, you’d use hash tables whenever you need to be able to store and lookup values during a programs execution.

Why is it valuable?

Hash tables are valuable because they’re fast data structures. Fast data structures makes fast code. Fast code makes $$$. Fast code also makes the world a better place. No one likes to wait.