-
Book Overview & Buying
-
Table Of Contents
Advanced C++ [Instructor Edition]
By :
In this activity, we'll be developing a simple spell check demonstration and try to make it faster incrementally. You can use the skeleton file, Speller.cpp, as a starting point. Perform the following steps to implement this activity:
set<string> setDict(vecDict.begin(), vecDict.end());
vector<int> ret;
for(int i = 0; i < vecText.size(); ++i)
{
const string &s = vecText[i];
if(!setDict.count(s))
{
ret.push_back(i);
}
};
$ g++ -O3 Speller1.cpp Timer.cpp
$ ./a.out
The following output will be generated:

#include <unordered_set>
unordered_set<string> setDict(vecDict.begin(), vecDict.end());
$ g++ -O3 Speller2.cpp Timer.cpp
$ ./a.out
The following output will be generated:

const size_t SIZE = 16777215;
template<size_t SEED> size_t hasher(const string &s)
{
size_t h = 0;
size_t len = s.size();
for(size_t i = 0; i < len; i++)
{
h = h * SEED + s[i];
}
return h & SIZE;
}
Here, we used an integer template parameter so that we can create any number of different hash functions with the same code. Notice the use of the 16777215 constant, which is equal to 2^24 – 1. This lets us use the fast bitwise-and operator instead of the modulus operator to keep the hashed integer less than SIZE. If you want to change the size, keep it as one less than a power of two.
vector<bool> m_Bloom;
m_Bloom.resize(SIZE);
for(auto i = vecDict.begin(); i != vecDict.end(); ++i)
{
m_Bloom[hasher<131>(*i)] = true;
m_Bloom[hasher<3131>(*i)] = true;
m_Bloom[hasher<31313>(*i)] = true;
}
for(int i = 0; i < vecText.size(); ++i)
{
const string &s = vecText[i];
bool hasNoBloom =
!m_Bloom[hasher<131>(s)]
&& !m_Bloom[hasher<3131>(s)]
&& !m_Bloom[hasher<31313>(s)];
if(hasNoBloom)
{
ret.push_back(i);
}
else if(!setDict.count(s))
{
ret.push_back(i);
}
}
The bloom filter is checked first and if it finds the word in the dictionary, we have to verify it, like we did previously.
$ g++ -O3 Speller3.cpp Timer.cpp
$ ./a.out
The following output will be generated:
In the preceding activity, we attempted to solve a real-world problem and make it more efficient. Let's consider some points for each of the implementations in the three steps, as follows:
From an implementation perspective, the following guidelines apply:
There are some questions worth probing, given the results we received:
Here are the answers to these questions:
std::unordered_set performs one hash operation and perhaps a couple of memory accesses before reaching the value that's stored. The Bloom filter we use performs three hash operations and three memory accesses. So, in essence, the work that's done by the bloom filter is more than the hash table. Since there are only 31,870 words in our dictionary, the cache benefits of the Bloom filter are lost. This is another case where the traditional analysis of data structures does not correspond to real-life results because of caching.
When a larger capacity is used, the number of hash collisions reduce, along with false positives, but the caching behavior worsens. Conversely, when a smaller capacity is used, the hash collisions and the false positives increase, but the caching behavior improves.
The more hash functions are used, the fewer the false positives, and vice versa.
Bloom filters work best when the cost of testing a few bits is less than the cost of accessing the value in the hash table. This only becomes true when the Bloom filter bits fit completely within the cache and the dictionary does not.
Change the font size
Change margin width
Change background colour