One of the most resourceful places to find good data is online. GET requests are common methods of communicating with an HTTP web server. In this recipe, we will grab all the links from a Wikipedia article and print them to the terminal. To easily grab all the links, we will use a helpful library called HandsomeSoup
, which lets us easily manipulate and traverse a webpage through CSS selectors.
We will be collecting all links from a Wikipedia web page. Make sure to have an Internet connection before running this recipe.
Install the HandsomeSoup
CSS selector package, and also install the HXT library if it is not already installed. To do this, use the following commands:
$ cabal install HandsomeSoup $ cabal install hxt
This recipe requires
hxt
for parsing HTML and requiresHandsomeSoup
for the easy-to-use CSS selectors, as shown in the following code snippet:import Text.XML.HXT.Core import Text.HandsomeSoup
Define and implement
main
as follows:main :: IO () main = do
Pass in the URL as a string to HandsomeSoup's
fromUrl
function:let doc = fromUrl "http://en.wikipedia.org/wiki/Narwhal"
Select all links within the
bodyContent
field of the Wikipedia page as follows:links <- runX $ doc >>> css "#bodyContent a" ! "href" print links