Hash Tables

Overview

Hash Tables allow you to do very quick lookup (random access) for unordered datasets. They go by several other names:

Maps / Hash Maps (C/C++)
Dictionaries (Python)
Associative Arrays (Bash)

Hash Functions

Hash functions are mathematical operations that take strings as input, and provide numbers as output.

Insert a unique string and get a unique number
Output is consistently linked to input (the function gives the same output every time)
Different inputs shouldnt map to the same output (collision)

A hash table simply uses a hash function to map strings into positions of an array.

Usecases

Need fast lookup for unordered (non-sequential) data
Preventing duplicate entries
Use as a cache (remember recently used data)
Model relationships between one object and another

Hash Collisions

We said that hash functions always map different keys to different slots. In reality, it's really hard to make a hash function where every input gives a different output.

Dealing with Collisions

The simplest way to deal with a collision is to replace the entry at a location with a linked list. Each overlapping entry would be contained in the list.

Doing this too much can severly decrease the performance of the table, since you end up with a linked list, right?

Hash Table Performance

	Average	Worst
Search	`O(1)`	`O(n)`
Insert	`O(1)`	`O(n)`
Delete	`O(1)`	`O(n)`

To improve performance, we need to limit collisions:

Keep a low Load Factor
Use a good Hash Function (it should spread items around very evenly)

Load Factor

Load factor: (number_of_filled_slots / total_number_of_slots)

It's generally a good idea to keep the load factor under 0.7

When you have to increase the size of the array, you should double the size. This will ammortize the resizing complexity