-
Book Overview & Buying
-
Table Of Contents
Haskell Data Analysis cookbook
By :
Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process.
In this recipe, we will find a table on a web page and gather all rows to be used in the program.
We will be extracting the values from an HTML table, so start by creating an input.html file containing a table as shown in the following figure:

The HTML behind this table is as follows:
$ cat input.html
<!DOCTYPE html>
<html>
<body>
<h1>Course Listing</h1>
<table>
<tr>
<th>Course</th>
<th>Time</th>
<th>Capacity</th>
</tr>
<tr>
<td>CS 1501</td>
<td>17:00</td>
<td>60</td>
</tr>
<tr>
<td>MATH 7600</td>
<td>14:00</td>
<td>25</td>
</tr>
<tr>
<td>PHIL 1000</td>
<td>9:30</td>
<td>120</td>
</tr>
</table>
</body>
</html>If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines:
$ cabal install hxt $ cabal install split
htx package for XML manipulations and the chunksOf function from the split package, as presented in the following code snippet:import Text.XML.HXT.Core import Data.List.Split (chunksOf)
main to read the input.html file.main :: IO () main = do input <- readFile "input.html"
readString, thereby setting withParseHTML to yes and optionally turning off warnings. Extract all the td tags and obtain the remaining text, as shown in the following code: texts <- runX $ readString
[withParseHTML yes, withWarnings no] input
//> hasName "td"
//> getTextlet rows = chunksOf 3 texts print $ findBiggest rows
findBiggest :: [[String]] -> [String]
findBiggest [] = []
findBiggest items = foldl1
(\a x -> if capacity x > capacity a
then x else a) items
capacity [a,b,c] = toInt c
capacity _ = -1
toInt :: String -> Int
toInt = read$ runhaskell Main.hs {"PHIL 1000", "9:30", "120"}
This is very similar to XML parsing, except we adjust the options of readString to [withParseHTML yes, withWarnings no].
Change the font size
Change margin width
Change background colour