Book Image

C++ Data Structures and Algorithm Design Principles

By : John Carey, Anil Achary, Shreyans Doshi, Payas Rajan
Book Image

C++ Data Structures and Algorithm Design Principles

By: John Carey, Anil Achary, Shreyans Doshi, Payas Rajan

Overview of this book

C++ is a mature multi-paradigm programming language that enables you to write high-level code with a high degree of control over the hardware. Today, significant parts of software infrastructure, including databases, browsers, multimedia frameworks, and GUI toolkits, are written in C++. This book starts by introducing C++ data structures and how to store data using linked lists, arrays, stacks, and queues. In later chapters, the book explains the basic algorithm design paradigms, such as the greedy approach and the divide-and-conquer approach, which are used to solve a large variety of computational problems. Finally, you will learn the advanced technique of dynamic programming to develop optimized implementations of several algorithms discussed in the book. By the end of this book, you will have learned how to implement standard data structures and algorithms in efficient and scalable C++ 14 code.
Table of Contents (11 chapters)

Chapter 3: Hash Tables and Bloom Filters

Activity 6: Mapping Long URLs to Short URLs

In this activity, we will create a program to map shorter URLs to corresponding longer URLs. Follow these steps to complete the activity:

  1. Let's include the required headers:

    #include <iostream>

    #include <unordered_map>

  2. Let's write a struct called URLService that will provide the interface for the required services:

    struct URLService

    {

        using ActualURL = std::string;

        using TinyURL = std::string;

    private:

        std::unordered_map<TinyURL, ActualURL> data;

    As we can see, we've created a map from the small URL to the original URL. This is because we use the small URL for the lookup. We want to convert it into the original URL. As we saw earlier, a map can do fast lookups based on a key. So, we have kept the smaller URL as the key of the map and the original URL as the value of the map. We have created aliases to avoid confusion regarding which string we are talking about.

  3. Let's add a lookup function:

    public:

        std::pair<bool, ActualURL> lookup(const TinyURL& url) const

        {

            auto it = data.find(url);

            if(it == data.end())  // If small URL is not registered.

            {

                return std::make_pair(false, std::string());

            }

            else

            {

                return std::make_pair(true, it->second);

            }

        }

  4. Now, let's write a function to register the smaller URL for the given actual URL:

    bool registerURL(const ActualURL& actualURL, const TinyURL& tinyURL)

    {

        auto found = lookup(tinyURL).first;

        if(found)

        {

            return false;

        }

        data[tinyURL] = actualURL;

        return true;

    }

    The registerURL function returns if there is already an existing entry in the data. If so, it will not touch the entry. Otherwise, it will register the entry and return true to indicate that.

  5. Now, let's write a function to delete the entry:

    bool deregisterURL(const TinyURL& tinyURL)

    {

        auto found = lookup(tinyURL).first;

        if(found)

        {

            data.erase(tinyURL);

            return true;

        }

        return false;

    }

    As we can see, we are using the lookup function instead of rewriting the find logic again. This function is much more readable now.

  6. Now, let's write a function to print all the mappings for logging:

    void printURLs() const

    {

        for(const auto& entry: data)

        {

            std::cout << entry.first << " -> " << entry.second << std::endl;

        }

        std::cout << std::endl;

    }

    };

  7. Now, write the main function so that we can use this service:

    int main()

    {

        URLService service;

        if(service.registerURL("https://www.packtpub.com/eu/big-data-and-business-intelligence/machine-learning-r-third-edition", "https://ml-r-v3"))

        {

            std::cout << "Registered https://ml-r-v3" << std::endl;

        }

        else

        {

            std::cout << "Couldn't register https://ml-r-v3" << std::endl;

        }

        if(service.registerURL("https://www.packtpub.com/eu/virtualization-and-cloud/hands-aws-penetration-testing-kali-linux", "https://aws-test-kali"))

        {

            std::cout << "Registered https://aws-test-kali" << std::endl;

        }

        else

        {

            std::cout << "Couldn't register https://aws-test-kali" << std::endl;

        }

        if(service.registerURL("https://www.packtpub.com/eu/application-development/hands-qt-python-developers", "https://qt-python"))

        {

            std::cout << "Registered https://qt-python" << std::endl;

        }

        else

        {

            std::cout << "Couldn't register https://qt-python" << std::endl;

        }

        

        auto findMLBook = service.lookup("https://ml-r-v3");

        if(findMLBook.first)

        {

            std::cout << "Actual URL: " << findMLBook.second << std::endl;

        }

        else

        {

            std::cout << "Couldn't find URL for book for ML." << std::endl;

        }

        auto findReactBook = service.lookup("https://react-cookbook");

        if(findReactBook.first)

        {

            std::cout << "Actual URL: " << findReactBook.second << std::endl;

        }

        else

        {

            std::cout << "Couldn't find URL for book for React." << std::endl;

        }

        if(service.deregisterURL("https://qt-python"))

        {

            std::cout << "Deregistered qt python link" << std::endl;

        }

        else

        {

            std::cout << "Couldn't deregister qt python link" << std::endl;

        }

        auto findQtBook = service.lookup("https://qt-python");

        if(findQtBook.first)

        {

            std::cout << "Actual URL: " << findQtBook.second << std::endl;

        }

        else

        {

            std::cout << "Couldn't find Qt Python book" << std::endl;

        }

        std::cout << "List of registered URLs: " << std::endl;

        service.printURLs();

    }

  8. Let's look at the output of the preceding code:

    Registered https://ml-r-v3

    Registered https://aws-test-kali

    Registered https://qt-python

    Actual URL: https://www.packtpub.com/eu/big-data-and-business-intelligence/machine-learning-r-third-edition

    Couldn't find URL for book for React.

    Deregistered qt python link

    Couldn't find Qt Python book

    List of registered URLs:

    https://ml-r-v3 -> https://www.packtpub.com/eu/big-data-and-business-intelligence/machine-learning-r-third-edition

    https://aws-test-kali -> https://www.packtpub.com/eu/virtualization-and-cloud/hands-aws-penetration-testing-kali-linux

As we can see, we are getting both the valid URLs at the end, and not the one we deregistered successfully.

Activity 7: Email Address Validator

In this activity, we will create a validator to check if an email address requested by a user is already taken. Complete the activity using these steps:

  1. Let's include the required headers:

    #include <iostream>

    #include <vector>

    #include <openssl/md5.h>

  2. Let's add a class for the Bloom filter:

    class BloomFilter

    {

        int nHashes;

        std::vector<bool> bits;

        static constexpr int hashSize = 128/8;

        unsigned char hashValue[hashSize];

  3. Let's add a constructor for this:

    BloomFilter(int size, int hashes) : bits(size), nHashes(hashes)

    {

        if(nHashes > hashSize)

        {

            throw ("Number of hash functions too high");

        }

        if(size > 255)

        {

            throw ("Size of bloom filter can't be >255");

        }

    }

    Since we're going to use each byte in the hash value buffer as a different hash function value, and the size of the hash value buffer is 16 bytes (128 bits), we can't have more hash functions than that. Since each hash value is just 1 byte, its possible values are 0 to 255. So, the size of the Bloom filter can't exceed 255. Hence, we're throwing an error in the constructor itself.

  4. Now, let's write a hash function. It simply uses the MD5 function to calculate the hash:

    void hash(const std::string& key)

    {

        MD5(reinterpret_cast<const unsigned char*>(key.data()), key.length(), hashValue);

    }

  5. Let's add the function so that we can insert an email:

    void add(const std::string& key)

    {

        hash(key);

        for(auto it = &hashValue[0]; it < &hashValue[nHashes]; it++)

        {

            bits[*it] = true;

        }

        std::cout << key << " added in bloom filter." << std::endl;

    }

    As we can see, we are iterating from the the bytes 0 to nHashes in the hash value buffer and setting each bit to 1.

  6. Similarly, let's add a function to find an email address:

    bool mayContain(const std::string &key)

        {

            hash(key);

            for (auto it = &hashValue[0]; it < &hashValue[nHashes]; it++)

            {

                if (!bits[*it])

                {

                    std::cout << key << " email can by used." << std::endl;

                    return false;

                }

            }

            std::cout << key << " email is used by someone else." << std::endl;

            return true;

        }

    };

  7. Let's add the main function:

    int main()

    {

        BloomFilter bloom(10, 15);

        bloom.add("[email protected]");

        bloom.add("[email protected]");

        bloom.mayContain("abc");

        bloom.mayContain("[email protected]");

        bloom.mayContain("xyz");

        bloom.add("[email protected]");

        bloom.add("[email protected]");

        bloom.mayContain("abc");

        bloom.mayContain("[email protected]");

    }

    The following is one of the possible outputs of the preceding code:

    [email protected] added in bloom filter.

    [email protected] added in bloom filter.

    abc email can by used.

    [email protected] email is used by someone else.

    xyz email can by used.

    [email protected] added in bloom filter.

    [email protected] added in bloom filter.

    abcd email can by used.

    [email protected] email is used by someone else.

This is one of the possible outputs because MD5 is a randomized algorithm. If we choose the number of functions and the size of the Bloom filter in a thoughtful way, we should get really good accuracy with the MD5 algorithm.