Book Image

Clojure Data Analysis Cookbook

By : Eric Rochester
Book Image

Clojure Data Analysis Cookbook

By: Eric Rochester

Overview of this book

<p>Data is everywhere and it's increasingly important to be able to gain insights that we can act on. Using Clojure for data analysis and collection, this book will show you how to gain fresh insights and perspectives from your data with an essential collection of practical, structured recipes.<br /><br />"The Clojure Data Analysis Cookbook" presents recipes for every stage of the data analysis process. Whether scraping data off a web page, performing data mining, or creating graphs for the web, this book has something for the task at hand.<br /><br />You'll learn how to acquire data, clean it up, and transform it into useful graphs which can then be analyzed and published to the Internet. Coverage includes advanced topics like processing data concurrently, applying powerful statistical techniques like Bayesian modelling, and even data mining algorithms such as K-means clustering, neural networks, and association rules.</p>
Table of Contents (18 chapters)
Clojure Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Scraping textual data from web pages


Not all of the data in the web are in tables. In general, the process to access this non-tabular data may be more complicated, depending on how the page is structured.

Getting ready

First, we'll use the same dependencies and require statements as we did in the last recipe.

Next, we'll identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-list.html.

This is a much more modern example of a web page. Instead of using tables, it marks up the text with the section and article tags and other features of HTML5.

How to do it…

  1. Since this is more complicated, we'll break the task down into a set of smaller functions.

    (defn get-family
      "This takes an article element and returns the family name."
      ([article]
       (string/join
         (map html/text (html/select article [:header :h2])))))
    
    (defn get-person
      "This takes a list item and returns a map of the persons' name
      and relationship."
      ([li]
       (let [[{pnames :content} rel] (:content li)]
         {:name (apply str pnames)
          :relationship (string/trim rel)})))
    
    (defn get-rows
      "This takes an article and returns the person mappings, with
      the family name added."
      ([article]
       (let [family (get-family article)]
         (map #(assoc % :family family)
              (map get-person
                   (html/select article [:ul :li]))))))
    
    (defn load-data
      "This downloads the HTML page and pulls the data out of it."
      [html-url]
      (let [html (html/html-resource (URL. html-url))
            articles (html/select html [:article])]
        (to-dataset (mapcat get-rows articles))))
  2. Now that those are defined, we just call load-data with the URL that we want to scrape.

    user=> (load-data (str "http://www.ericrochester.com/"
      #_=>             "clj-data-analysis/data/small-sample-list.html ")
    [:family :name :relationship]
    ["Addam's Family" "Gomez Addams" "— father"]
    ["Addam's Family" "Morticia Addams" "— mother"]
    ["Addam's Family" "Pugsley Addams" "— brother"]
    ["Addam's Family" "Wednesday Addams" "— sister"]
    …

How it works…

After examining the web page, we find that each family is wrapped in an article tag that contains a header with an h2 tag. get-family pulls that tag out and returns its text.

get-person processes each person. The people in each family are in an unordered list (ul) and each person is in an li tag. The person's name itself is in an em tag. The let gets the contents of the li tag and decomposes it in order to pull out the name and relationship strings. get-person puts both pieces of information into a map and returns it.

get-rows processes each article tag. It calls get-family to get that information from the header, gets the list item for each person, calls get-person on that list item, and adds the family to each person's mapping.

Here's how the HTML structures correspond to the functions that process them. Each function name is beside the element it parses:

Finally, load-data ties the process together by downloading and parsing the HTML file and pulling the article tags from it. It then calls get-rows to create the data mappings, and converts the output to a dataset.