Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Haskell Data Analysis cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook

By : Nishant Shukla
3.7 (6)
close
close
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook

3.7 (6)
By: Nishant Shukla

Overview of this book

Step-by-step recipes filled with practical code samples and engaging examples demonstrate Haskell in practice, and then the concepts behind the code. This book shows functional developers and analysts how to leverage their existing knowledge of Haskell specifically for high-quality data analysis. A good understanding of data sets and functional programming is assumed.
Table of Contents (14 chapters)
close
close
13
Index

Traversing online directories for data

A directory search typically provides names and contact information per query. By brute forcing many of these search queries, we can obtain all data stored in the directory listing database. This recipe runs thousands of search queries to obtain as much data as possible from a directory search. This recipe is provided only as a learning tool to see the power and simplicity of data gathering in Haskell.

Getting ready

Make sure to have a strong Internet connection.

Install the hxt and HandsomeSoup packages using Cabal:

$ cabal install hxt
$ cabal install HandsomeSoup

How to do it...

  1. Set up the following dependencies:
    import Network.HTTP
    import Network.URI
    import Text.XML.HXT.Core
    import Text.HandsomeSoup
  2. Define a SearchResult type, which may either fault in an error or result in a success, as presented in the following code:
    type SearchResult = Either SearchResultErr [String]
    data SearchResultErr = NoResultsErr 
                         | TooManyResultsErr 
                         | UnknownErr     
                         deriving (Show, Eq)
  3. Define the POST request specified by the directory search website. Depending on the server, the POST request will be different. Instead of rewriting code, we use the myRequest function defined in the previous recipe.
  4. Write a helper function to obtain the document from a HTTP POST request, as shown in the following code:
    getDoc query = do  
        rsp <- simpleHTTP $ myRequest query
        html <- getResponseBody rsp
        return $ readString [withParseHTML yes, withWarnings no] html
  5. Scan the HTML document and return whether there is an error or provide the resulting data. The code in this function is dependent on the error messages produced by the web page. In our case, the error messages are the following:
    scanDoc doc = do
        errMsg <- runX $ doc >>> css "h3" //> getText
    
        case errMsg of 
            [] -> do 
                text <- runX $ doc >>> css "td" //> getText 
                return $ Right text
            "Error: Sizelimit exceeded":_ -> 
                return $ Left TooManyResultsErr
            "Too many matching entries were found":_ -> 
                return $ Left TooManyResultsErr
            "No matching entries were found":_ -> 
                return $ Left NoResultsErr
            _ -> return $ Left UnknownErr
  6. Define and implement main. We will use a helper function, main', as shown in the following code snippet, to recursively brute force the directory listing:
    main :: IO ()
    main = main' "a"
  7. Run a search of the query and then recursively again on the next query:
    main' query = do
        print query
        doc <- getDoc query
        searchResult <- scanDoc doc
        print searchResult
        case searchResult of
            Left TooManyResultsErr -> 
                main' (nextDeepQuery query)
            _ -> if (nextQuery query) >= endQuery 
                  then print "done!" else main' (nextQuery query)
  8. Write helper functions to define the next logical query as follows:
    nextDeepQuery query = query ++ "a"
    
    nextQuery "z" = endQuery
    nextQuery query = if last query == 'z'
                      then nextQuery $ init query
                      else init query ++ [succ $ last query]
    endQuery = [succ 'z']

How it works...

The code starts by searching for "a" in the directory lookup. This will most likely fault in an error as there are too many results. So, in the next iteration, the code will refine its search by querying for "aa", then "aaa", until there is no longer TooManyResultsErr :: SearchResultErr.

Then, it will enumerate to the next logical search query "aab", and if that produces no result, it will search for "aac", and so on. This brute force prefix search will obtain all items in the database. We can gather the mass of data, such as names and department types, to perform interesting clustering or analysis later on. The following figure shows how the program starts:

How it works...
Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Haskell Data Analysis cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon