-
Book Overview & Buying
-
Table Of Contents
-
Feedback & Rating
Haskell Data Analysis cookbook
By :
A directory search typically provides names and contact information per query. By brute forcing many of these search queries, we can obtain all data stored in the directory listing database. This recipe runs thousands of search queries to obtain as much data as possible from a directory search. This recipe is provided only as a learning tool to see the power and simplicity of data gathering in Haskell.
Make sure to have a strong Internet connection.
Install the hxt and HandsomeSoup packages using Cabal:
$ cabal install hxt $ cabal install HandsomeSoup
import Network.HTTP import Network.URI import Text.XML.HXT.Core import Text.HandsomeSoup
SearchResult type, which may either fault in an error or result in a success, as presented in the following code:type SearchResult = Either SearchResultErr [String]
data SearchResultErr = NoResultsErr
| TooManyResultsErr
| UnknownErr
deriving (Show, Eq)myRequest function defined in the previous recipe.getDoc query = do
rsp <- simpleHTTP $ myRequest query
html <- getResponseBody rsp
return $ readString [withParseHTML yes, withWarnings no] htmlscanDoc doc = do
errMsg <- runX $ doc >>> css "h3" //> getText
case errMsg of
[] -> do
text <- runX $ doc >>> css "td" //> getText
return $ Right text
"Error: Sizelimit exceeded":_ ->
return $ Left TooManyResultsErr
"Too many matching entries were found":_ ->
return $ Left TooManyResultsErr
"No matching entries were found":_ ->
return $ Left NoResultsErr
_ -> return $ Left UnknownErrmain. We will use a helper function, main', as shown in the following code snippet, to recursively brute force the directory listing:main :: IO () main = main' "a"
main' query = do
print query
doc <- getDoc query
searchResult <- scanDoc doc
print searchResult
case searchResult of
Left TooManyResultsErr ->
main' (nextDeepQuery query)
_ -> if (nextQuery query) >= endQuery
then print "done!" else main' (nextQuery query)nextDeepQuery query = query ++ "a"
nextQuery "z" = endQuery
nextQuery query = if last query == 'z'
then nextQuery $ init query
else init query ++ [succ $ last query]
endQuery = [succ 'z']The code starts by searching for "a" in the directory lookup. This will most likely fault in an error as there are too many results. So, in the next iteration, the code will refine its search by querying for "aa", then "aaa", until there is no longer TooManyResultsErr :: SearchResultErr.
Then, it will enumerate to the next logical search query "aab", and if that produces no result, it will search for "aac", and so on. This brute force prefix search will obtain all items in the database. We can gather the mass of data, such as names and department types, to perform interesting clustering or analysis later on. The following figure shows how the program starts:

Change the font size
Change margin width
Change background colour