HASH TABLES



What is a Hash Table?

A hash table allows us to perform constant time insertion and search operation on large amounts of data.
A typical implementation would be used in the symbol table of a compiler. The compiler stores literals in hash-tables for performing fast lookup on them during the parsing process.

Suppose we have this data set:
23, 45, 154, 1223, 46, 77, 97, 99, 100, 0, 11

continued below....

What is a bucket?

A bucket is an information store where values having the same 'hash' value are stored.
So, 2 or more values having the same hash value are stored in the same bucket in sorted or unsorted order within the same bucket.

An example...

Suppose the hash-table consists of 10 buckets, then the hash table would be given by:

Bucket Number
Values
0
100
0
1
11
-
2
-
-
3
23
1223
4
154
-
5
45
-
6
46
-
7
77
97
8
-
-
9
99
-


The HASH-function chosen here is NUMBER%10, so the value obtained after applying the above formula is the bucket number into which we insert the element.


Searching for specified values in the hash table.

Searching becomes fairly trivial. We obtain the bucket number for the element to be searched by applying the formula, and then apply a linear search algorithm on that bucket to see if that value exists. An optimization would be to maintain the data in the buckets in sorted order, so that some efficient searching algorithm like Binary Search may be used. However, maintaining the data in sorted order is costly, and also if the hash function has been chosen properly, then we would expect collisions to occur less frequently, so sorting the data within a bucket is usually not necessary unless the application demands that ranges of values be easily accessible, and repearted values be allowed.






Multithreaded Approach-1



In the first approach, we assume:
1. During searching we try to find more or less all items.
2. The number of buckets can be fixed before we begin inserting the values into the hash-table.

If (2) can not be fixed, then we may adopt one out of 2 methods:
A. Re-hash when the hash-table is resized.
B. Provide a new hash-function when the table is resized. Then the search function will try both the hash functions while searching for elements. This severely limits the re-sizing capabilities of the hash-table. If the hash-table is resized say more than 2 times, then the overhead for searching will be quite large, so if we use this approach, we should at max by the 1st resize operation have fixed the size of the hash-table.

What is the 1st approach?

In this method, we maintain during the insertion phase the same number of parallel hash-tables as there are going to be threads. This can be done automatically by the hash-table dynamically(at run-time). As a new thread is introduced, it can create dynamically one more parallel hash-table. Each thread will then insert only into it's private hash-table. This is again handled by the hash-table implementation automatically.

We decide a threshold value for each running thread, and hence each parallel hash-table. If the number of elements in that parallel hash-table exceeds the threshold value, then the elements are dumped into the global hash-table. This process of dumping values to the global hash-table is slightly complicated.

We first divide the global hash-table into 'n' equal parts. Then, we have have locks for each of these 'n' parts of the global hash-table. If a thread wishes to write values into any of the 'n' parts of the global hash-table, then it must first acquire the lock for that part. This will ensure that only 1 thread is writing data into some part. However, the order in which each of these threads acquires is different for each thread. This ensures that if 2 or more threads reach the threshold value at the same time, or there abouts, then the concurrent locking time can be reduced. A possible approach is discussed below:

Suppose we number of threads say 5, then they are numbered according to the time at which they first inserted values into their hash-table. ie. The time of their birth. So, the thread that inserted the value first is numbered 1, the next one 2, and so on until 5.

We then use a formula to determine the starting block that that particular thread will touch on the global hash-table. The formula is:

Index = Thread_number * Number of parts(n) / Total number of threads.

So, if the 3rd thread wishes to update the global hash-table, then, it would start at the:
3 * n / 5 the position. For convenience, we assume that n = 10.

Thus, Index = 3*10/5 = 3*2 = 6.

Thus, the 3rd thread will begin synchronizing the 6th part of the global hash-table with itself, and will begin inserting values there. Then, it moves onto the 7th, 8th, 9th, 10, and wraps around to the 1st part until it reaches back to the 6th part. At this point in time, the private(local) hash-table is empty, and the global hash-table is updated with the value that this one contained before the synchronization.

After all insertions are over, all the threads die, and the parallel hash-tables dump their values into the global hash-table automatically. This operation can also be performed manually.

Searching...

The searching process can proceed as described above for the general case.






Multithreaded Approach-2



This method is inspired somewhat by Tarjan's splay trees.

The assumptions made are quite different from theat in the first approach. They are:
1. Not ALL values in the hash-table will be searched. This means that some/most values will never be searched, and values searched will be searched for again.
2. The number of buckets can be fixed initially.

We begin in a similar approach as we did in the first method, except, that we never synchronize the parallely maintained hash-tables with the global hash-table. Instead, when the insertion phase is completed, the global hash-table is initialized to all empty buckets. Each bucket has a valid bit associated with it. Initially all the valid bits are reset(set to 0 to indicate invalid state for that bucket).

Searching...

The searching differs greatly here.

Now, when a search request for a value is received, then the hash value is first computed, and the valid bit-for that bucket is checked. If the bucket is in a valid state, then the normal linear search algorithm/binary search algorithm is applied on that bucket, and the value is returned accordingly. However, if the valid bit indicates that the bucket is invalid, then it must be brought into a valid state.

A possible way of doing this would be:

First, divide the global hash-table into 'n' parts as in the previous approach, and associate a mutex/lock with each part. When ever the searching algorithm tries to update a bucket, it must first lock the mutex that is associated with that part of the global hash-table which that bucket is associated with. Then, it dumps the values from a corresponding bucket in all the parallel hash tables that exist. The valid bit for this bucket in the global hash-table is set to indicate a valid bucket. Then, the lock is given back, and the linear/binary search on that bucket on the global hash-table is performed as in the case when the valid bit is indicating a valid bucket state.

We must note that the checking of the valid bit must be an atomic operation, because it may be called concurrently. However, practially, we may not implement it that way, because the mutex lock after will take care of that in case of a conflict. At max. (worst case), what will happen is that when the bucket is actually valid, and the bit is being set by another thread, the next thread might query the bit, and find it to be invalid or in an inconsistent state. This might generate machine exceptions on some architectures, but on x86, there is no such problem. In such a case, the next thread might try to bring the bucket into a consistent state by synchronizing that global bucket with those of the parallel hash-tables. Since they are already empty(after the first thread has released the mutex), then the synchronization operation will evaluate to a NOP logically speaking. The valid bit is made set, and the search operation continues normally.





A Final Note...



Both the above mentioned approaches seem very promising given that they are used in the proper context and environment, and used in a manner in which they were ment to be used.

However, the 2nd approach does suffer from some noteceable defects, namely:

1. The parallel hash tables need to be maintained after the inserton phase has completed. That's because the update of the global hash-table happens only during the searching phase. So, the extra memory overhead is involved.

2. There is a check required at every stage to determing the valid state of a bucket. This overhead being very negligible, is an overhead none the less compared with the 1st approach.


Collision resolution techniques such as using a second HASH function seem very promising, and could be used in conjunction with the 2 approaches discussed above.



Hosted by www.Geocities.ws

1