HASH TABLES
What
is a Hash Table?
A hash table allows us to perform constant time insertion and search
operation on large amounts of data.
A typical implementation would be used in the symbol table of a
compiler. The compiler stores literals in hash-tables for performing
fast lookup on them during the parsing process.
Suppose we have this data set:
23, 45, 154, 1223, 46, 77, 97, 99, 100, 0, 11
continued below....
What is a bucket?
A bucket is an information store where values having the same 'hash'
value are stored.
So, 2 or more values having the same hash value are stored in the same
bucket in sorted or unsorted order within the same bucket.
An example...
Suppose the hash-table consists of 10 buckets, then the hash table
would be given by:
Bucket Number
|
Values
|
0
|
100
|
0
|
1
|
11
|
-
|
2
|
-
|
-
|
3
|
23
|
1223
|
4
|
154
|
-
|
5
|
45
|
-
|
6
|
46
|
-
|
7
|
77
|
97
|
8
|
-
|
-
|
9
|
99
|
-
|
The HASH-function chosen here is NUMBER%10, so the value obtained after
applying the above formula is the bucket number into which we insert the
element.
Searching for specified values in the
hash table.
Searching becomes fairly trivial. We obtain the bucket number for the
element to be searched by applying the formula, and then apply a linear
search algorithm on that bucket to see if that value exists. An
optimization would be to maintain the data in the buckets in sorted
order, so that some efficient searching algorithm like Binary Search may
be used. However, maintaining the data in sorted order is costly, and
also if the hash function has been chosen properly, then we would expect
collisions to occur less frequently, so sorting the data within a
bucket is usually not necessary unless the application demands that
ranges of values be easily accessible, and repearted values be allowed.
Multithreaded Approach-1
In the first approach, we assume:
1. During searching we try to find more or less all items.
2. The number of buckets can be fixed before we begin inserting the
values into the hash-table.
If (2) can not be fixed, then we may adopt one out of 2 methods:
A. Re-hash when the hash-table is resized.
B. Provide a new hash-function when the table is resized. Then the
search function will try both the hash functions while searching for
elements. This severely limits the re-sizing capabilities of the
hash-table. If the hash-table is resized say more than 2 times, then the
overhead for searching will be quite large, so if we use this approach,
we should at max by the 1st resize operation have fixed the size of the
hash-table.
What is the 1st approach?
In this method, we maintain during the insertion phase the same number
of parallel hash-tables as there are going to be threads. This can be
done automatically by the hash-table dynamically(at run-time). As a new
thread is introduced, it can create dynamically one more parallel
hash-table. Each thread will then insert only into it's private
hash-table. This is again handled by the hash-table implementation
automatically.
We decide a threshold value for each running thread, and hence each
parallel hash-table. If the number of elements in that parallel
hash-table exceeds the threshold value, then the elements are dumped
into the global hash-table. This process of dumping values to the global
hash-table is slightly complicated.
We first divide the global hash-table into 'n' equal parts. Then, we
have have locks for each of these 'n' parts of the global hash-table. If
a thread wishes to write values into any of the 'n' parts of the global
hash-table, then it must first acquire the lock for that part. This
will ensure that only 1 thread is writing data into some part. However,
the order in which each of these threads acquires is different for each
thread. This ensures that if 2 or more threads reach the threshold
value at the same time, or there abouts, then the concurrent locking
time can be reduced. A possible approach is discussed below:
Suppose we number of threads say 5, then they are numbered according to
the time at which they first inserted values into their hash-table. ie.
The time of their birth. So, the thread that inserted the value first is
numbered 1, the next one 2, and so on until 5.
We then use a formula to determine the starting block that that
particular thread will touch on the global hash-table. The formula is:
Index = Thread_number * Number of parts(n) / Total number of threads.
So, if the 3rd thread wishes to update the global hash-table, then, it
would start at the:
3 * n / 5 the position. For convenience, we assume that n = 10.
Thus, Index = 3*10/5 = 3*2 = 6.
Thus, the 3rd thread will begin synchronizing the 6th part of the
global hash-table with itself, and will begin inserting values there.
Then, it moves onto the 7th, 8th, 9th, 10, and wraps around to the 1st
part until it reaches back to the 6th part. At this point in time, the
private(local) hash-table is empty, and the global hash-table is updated
with the value that this one contained before the synchronization.
After all insertions are over, all the threads die, and the parallel
hash-tables dump their values into the global hash-table automatically.
This operation can also be performed manually.
Searching...
The searching process can proceed as described above for the general
case.
Multithreaded Approach-2
This method is inspired somewhat
by Tarjan's splay trees.
The assumptions made are quite different from theat in the first
approach. They are:
1. Not ALL values in the hash-table will be searched. This means that
some/most values will never be searched, and values searched will be
searched for again.
2. The number of buckets can be fixed initially.
We begin in a similar approach as we did in the first method, except,
that we never synchronize the parallely maintained hash-tables with the
global hash-table. Instead, when the insertion phase is completed, the
global hash-table is initialized to all empty buckets. Each bucket has a
valid bit associated with it. Initially all the valid bits are
reset(set to 0 to indicate invalid state for that bucket).
Searching...
The searching differs greatly here.
Now, when a search request for a value is received, then the hash value
is first computed, and the valid bit-for that bucket is checked. If the
bucket is in a valid state, then the normal linear search
algorithm/binary search algorithm is applied on that bucket, and the
value is returned accordingly. However, if the valid bit indicates that
the bucket is invalid, then it must be brought into a valid state.
A possible way of doing this would be:
First, divide the global hash-table into 'n' parts as in the previous
approach, and associate a mutex/lock with each part. When ever the
searching algorithm tries to update a bucket, it must first lock the
mutex that is associated with that part of the global hash-table which
that bucket is associated with. Then, it dumps the values from a
corresponding bucket in all the parallel hash tables that exist. The
valid bit for this bucket in the global hash-table is set to indicate a
valid bucket. Then, the lock is given back, and the linear/binary search
on that bucket on the global hash-table is performed as in the case
when the valid bit is indicating a valid bucket state.
We must note that the checking of the valid bit must be an atomic
operation, because it may be called concurrently. However, practially,
we may not implement it that way, because the mutex lock after will take
care of that in case of a conflict. At max. (worst case), what will
happen is that when the bucket is actually valid, and the bit is being
set by another thread, the next thread might query the bit, and find it
to be invalid or in an inconsistent state. This might generate machine
exceptions on some architectures, but on x86, there is no such problem.
In such a case, the next thread might try to bring the bucket into a
consistent state by synchronizing that global bucket with those of the
parallel hash-tables. Since they are already empty(after the first
thread has released the mutex), then the synchronization operation will
evaluate to a NOP logically speaking. The valid bit is made set, and
the search operation continues normally.
A Final Note...
Both the above mentioned
approaches seem very promising given that they are used in the proper
context and environment, and used in a manner in which they were ment to
be used.
However, the 2nd approach does suffer from some noteceable defects,
namely:
1. The parallel hash tables need to be maintained after the inserton
phase has completed. That's because the update of the global hash-table
happens only during the searching phase. So, the extra memory overhead
is involved.
2. There is a check required at every stage to determing the valid
state of a bucket. This overhead being very negligible, is an overhead
none the less compared with the 1st approach.
Collision resolution techniques such as using a second HASH function
seem very promising, and could be used in conjunction with the 2
approaches discussed above.