There's data everywhere on the Internet. Unfortunately, a lot of it is difficult to get to. It's buried in tables, or articles, or deeply nested div tags. Web scraping is brittle and laborious, but it's often the only way to free this data so we can use it in our analyses. This recipe describes how to load a web page and dig down into its contents so you can pull the data out.
To do this, we're going to use the Enlive library (https://github.com/cgrand/enlive/wiki). This uses a domain-specific language (DSL) based on CSS selectors for locating elements within a web page. This library can also be used for templating. In this case, we'll just use it to get data back out of a web page.
First we have to add Enlive to the dependencies of the project:
:dependencies [[org.clojure/clojure "1.4.0"] [incanter/incanter-core "1.4.1"] [enlive "1.0.1"]]
Next, we use those packages in our REPL interpreter or script:
(require '(clojure [string :as string])) (require '(net.cgrand [enlive-html :as html])) (use 'incanter.core) (import [java.net URL])
Finally, identify the file to scrape the data from. I've put up a file at http://www.ericrochester.com/clj-data-analysis/data/small-sample-table.html, which looks like the following:
It's intentionally stripped down, and it makes use of tables for layout (hence the comment about 1999).
Since this task is a little complicated, let's pull the steps out into several functions:
(defn to-keyword "This takes a string and returns a normalized keyword." [input] (-> input string/lower-case (string/replace \space \-) keyword)) (defn load-data "This loads the data from a table at a URL." [url] (let [html (html/html-resource (URL. url)) table (html/select html [:table#data]) headers (->> (html/select table [:tr :th]) (map html/text) (map to-keyword) vec) rows (->> (html/select table [:tr]) (map #(html/select % [:td])) (map #(map html/text %)) (filter seq))] (dataset headers rows)))
Now, call
load-data
with the URL you want to load data from:user=> (load-data (str "http://www.ericrochester.com/" #_=> "clj-data-analysis/data/small-sample-table.html ")) [:given-name :surname :relation] ["Gomez" "Addams" "father"] ["Morticia" "Addams" "mother"] ["Pugsley" "Addams" "brother"] ["Wednesday" "Addams" "sister"] …
The let
bindings in load-data
tell the story here. Let's take them one by one.
The first binding has Enlive download the resource and parse it into its internal representation:
(let [html (html/html-resource (URL. url))
The next binding selects the table with the ID data
:
table (html/select html [:table#data])
Now, we select all header cells from the table, extract the text from them, convert each to a keyword, and then the whole sequence into a vector. This gives us our headers for the dataset:
headers (->> (html/select table [:tr :th]) (map html/text) (map to-keyword) vec)
We first select each row individually. The next two steps are wrapped in map
so that the cells in each row stay grouped together. In those steps, we select the data cells in each row and extract the text from each. And lastly, we filter using seq
, which removes any rows with no data, such as the header row:
rows (->> (html/select table [:tr]) (map #(html/select % [:td])) (map #(map html/text %)) (filter seq))]
Here is another view of this data. In the following screenshot, we can see some of the code from this web page. The variable names and the select expressions are placed beside the HTML structures that they match. Hopefully, this makes it more clear how the select expressions correspond to the HTML elements.
Finally, we convert everything to a dataset.
incanter.core/dataset
is a lower-level constructor than incanter.core/to-dataset
. It requires us to pass in the column names and data matrix as separate sequences:
(dataset headers rows)))
It's important to realize that the code, as presented here, is the result of a lot of trial and error. Screen scraping usually is. Generally I download the page and save it, so I don't have to keep requesting it from the web server. Then I start REPL and parse the web page there. Then, I can look at the web page and HTML with the browser's "view source" functionality, and I can examine the data from the web page interactively in the REPL interpreter. While working, I copy and paste the code back and forth between the REPL interpreter and my text editor, as it's convenient. This workflow and environment makes screen scraping—a fiddly, difficult task even when all goes well—almost enjoyable.