Book Image

Mastering Clojure Data Analysis

By : Eric Richard Rochester
Book Image

Mastering Clojure Data Analysis

By: Eric Richard Rochester

Overview of this book

Table of Contents (17 chapters)
Mastering Clojure Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Getting the data


To get a copy of the SOTU addresses, we'll visit the website for the American Presidency Project at the University of California, Santa Barbara (http://www.presidency.ucsb.edu/). This site has the text for the SOTU addresses as well as an archive of many messages, letters, public papers, and other documents for various presidents. It's a great resource for looking at political rhetoric.

In this case, we'll write some code to visit the index page for the SOTU addresses. From there, we'll visit each of the pages that contain an address; remove the menus, headers, and footers; and strip out the HTML. We'll save this in a file in the data directory.

We won't see all of the code for this. To see the rest, look at the download.clj file in the src/tm_sotu/ directory in the downloaded code.

To handle downloading and parsing the files, we'll use the Enlive library (https://github.com/cgrand/enlive/wiki). This library provides a DSL to navigate and pull data from HTML pages. The syntax...