Mining Hypertext Markup Language (HTML) is often a feat of identifying and parsing only its structured segments. Not all text in an HTML file may be useful, so we find ourselves only focusing on a specific subset. For instance, HTML tables and lists provide a strong and commonly used structure to extract data whereas a paragraph in an article may be too unstructured and complicated to process.
In this recipe, we will find a table on a web page and gather all rows to be used in the program.
We will be extracting the values from an HTML table, so start by creating an input.html
file containing a table as shown in the following figure:
The HTML behind this table is as follows:
$ cat input.html <!DOCTYPE html> <html> <body> <h1>Course Listing</h1> <table> <tr> <th>Course</th> <th>Time</th> <th>Capacity</th> </tr> <tr> <td>CS 1501</td> <td>17:00</td> <td>60</td> </tr> <tr> <td>MATH 7600</td> <td>14:00</td> <td>25</td> </tr> <tr> <td>PHIL 1000</td> <td>9:30</td> <td>120</td> </tr> </table> </body> </html>
If not already installed, use Cabal to set up the HXT library and the split library, as shown in the following command lines:
$ cabal install hxt $ cabal install split
We will need the
htx
package for XML manipulations and thechunksOf
function from the split package, as presented in the following code snippet:import Text.XML.HXT.Core import Data.List.Split (chunksOf)
Define and implement
main
to read theinput.html
file.main :: IO () main = do input <- readFile "input.html"
Feed the HTML data into
readString
, thereby settingwithParseHTML
toyes
and optionally turning off warnings. Extract all thetd
tags and obtain the remaining text, as shown in the following code:texts <- runX $ readString [withParseHTML yes, withWarnings no] input //> hasName "td" //> getText
The data is now usable as a list of strings. It can be converted into a list of lists similar to how CSV was presented in the previous CSV recipe, as shown in the following code:
let rows = chunksOf 3 texts print $ findBiggest rows
By folding through the data, identify the course with the largest capacity using the following code snippet:
findBiggest :: [[String]] -> [String] findBiggest [] = [] findBiggest items = foldl1 (\a x -> if capacity x > capacity a then x else a) items capacity [a,b,c] = toInt c capacity _ = -1 toInt :: String -> Int toInt = read
Running the code will display the class with the largest capacity as follows:
$ runhaskell Main.hs {"PHIL 1000", "9:30", "120"}