Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Haskell Data Analysis cookbook
  • Table Of Contents Toc
  • Feedback & Rating feedback
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook

By : Nishant Shukla
3.7 (6)
close
close
Haskell Data Analysis cookbook

Haskell Data Analysis cookbook

3.7 (6)
By: Nishant Shukla

Overview of this book

Step-by-step recipes filled with practical code samples and engaging examples demonstrate Haskell in practice, and then the concepts behind the code. This book shows functional developers and analysts how to leverage their existing knowledge of Haskell specifically for high-quality data analysis. A good understanding of data sets and functional programming is assumed.
Table of Contents (14 chapters)
close
close
13
Index

Reading an XML file using the HXT package

Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/).

In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates.

Getting ready

We will first set up an XML file called input.xml with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows:

$ cat input.xml

<thread>
    <email>
        <to>Databender</to>
        <from>Princess</from>
        <date>Thu Dec 18 15:03:23 EST 2014</date>
        <subject>Joke</subject>
        <body>Why did you divide sin by tan?</body>
    </email>
    <email>
        <to>Princess</to>
        <from>Databender</from>
        <date>Fri Dec 19 3:12:00 EST 2014</date>
        <subject>RE: Joke</subject>
        <body>Just cos.</body>
    </email>
</thread>

Using Cabal, install the HXT library which we use for manipulating XML documents:

$ cabal install hxt

How to do it...

  1. We only need one import, which will be for parsing XML, using the following line of code:
    import Text.XML.HXT.Core
  2. Define and implement main and specify the XML location. For this recipe, the file is retrieved from input.xml. Refer to the following code:
    main :: IO ()
    main = do
        input <- readFile "input.xml"
  3. Apply the readString function to the input and extract all the date documents. We filter items with a specific name using the hasName :: String -> a XmlTree XmlTree function. Also, we extract the text using the getText :: a XmlTree String function, as shown in the following code snippet:
        dates <- runX $ readString [withValidate no] input 
            //> hasName "date" 
            //> getText
  4. We can now use the list of extracted dates as follows:
        print dates
  5. By running the code, we print the following output:
     $ runhaskell Main.hs
    
    ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"]
    

How it works...

The library function, runX, takes in an Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX function in this recipe takes in IOSArrow XmlTree String and returns an IO action of the String type. We generate this IOSArrow object using the readString function, which performs a series of operations to the XML data.

For a deep insight into the XML document, //> should be used whereas /> only looks at the current level. We use the //> function to look up the date attributes and display all the associated text.

As defined in the documentation, the hasName function tests whether a node has a specific name, and the getText function selects the text of a text node. Some other functions include the following:

  • isText: This is used to test for text nodes
  • isAttr: This is used to test for an attribute tree
  • hasAttr: This is used to test whether an element node has an attribute node with a specific name
  • getElemName: This is used to select the name of an element node

All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Haskell Data Analysis cookbook
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon