Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Haskell Data Analysis cookbook
  • Table Of Contents Toc
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook

By : Nishant Shukla
3.7 (6)
close
close
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook

3.7 (6)
By: Nishant Shukla

Overview of this book

Step-by-step recipes filled with practical code samples and engaging examples demonstrate Haskell in practice, and then the concepts behind the code. This book shows functional developers and analysts how to leverage their existing knowledge of Haskell specifically for high-quality data analysis. A good understanding of data sets and functional programming is assumed.
Table of Contents (14 chapters)
close
close
13
Index

Capturing table rows from an HTML page

Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process.

In this recipe, we will find a table on a web page and gather all rows to be used in the program.

Getting ready

We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure:

Getting ready

The HTML behind this table is as follows:

$ cat input.html

<!DOCTYPE html>
<html>
    <body>
        <h1>Course Listing</h1>
        <table>
            <tr>
                <th>Course</th>
                <th>Time</th>
                <th>Capacity</th>
            </tr>
            <tr>
                <td>CS 1501</td>
                <td>17:00</td>
                <td>60</td>
            </tr>
            <tr>
                <td>MATH 7600</td>
                <td>14:00</td>
                <td>25</td>
            </tr>
            <tr>
                <td>PHIL 1000</td>
                <td>9:30</td>
                <td>120</td>
            </tr>
        </table>
    </body>
</html>

If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines:

$ cabal install hxt
$ cabal install split

How to do it...

  1. We will need the htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet:
    import Text.XML.HXT.Core
    import Data.List.Split (chunksOf)
  2. Define and implement main to read the input.html file.
    main :: IO ()
    main = do
      input <- readFile "input.html"
  3. Feed the HTML data into readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code:
      texts <- runX $ readString 
               [withParseHTML yes, withWarnings no] input 
        //> hasName "td"
        //> getText
  4. The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code:
      let rows = chunksOf 3 texts
      print $ findBiggest rows
  5. By folding through the data, identify the course with the largest capacity using the following code snippet:
    findBiggest :: [[String]] -> [String]
    findBiggest [] = []
    findBiggest items = foldl1 
                        (\a x -> if capacity x > capacity a 
                                 then x else a) items
    
    capacity [a,b,c] = toInt c
    capacity _ = -1
    
    toInt :: String -> Int
    toInt = read
  6. Running the code will display the class with the largest capacity as follows:
    $ runhaskell Main.hs
    
    {"PHIL 1000", "9:30", "120"}
    

How it works...

This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Haskell Data Analysis cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon