Clojure Data Analysis Cookbook - Second Edition

Book Image

Clojure Data Analysis Cookbook - Second Edition

By : Eric Richard Rochester

Book Image

Clojure Data Analysis Cookbook - Second Edition

By: Eric Richard Rochester

Overview of this book

Clojure Data Analysis Cookbook Second Edition

Clojure Data Analysis Cookbook Second Edition

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Importing Data for Analysis

Importing Data for Analysis

Creating a new project

Reading CSV data into Incanter datasets

Reading JSON data into Incanter datasets

Reading data from Excel with Incanter

Reading data from JDBC databases

Reading XML data into Incanter datasets

Scraping data from tables in web pages

Scraping textual data from web pages

Reading RDF data

Querying RDF data with SPARQL

Aggregating data from different formats

Cleaning and Validating Data

Cleaning and Validating Data

Cleaning data with regular expressions

Maintaining consistency with synonym maps

Identifying and removing duplicate data

Regularizing numbers

Calculating relative values

Parsing dates and times

Lazily processing very large data sets

Sampling from very large data sets

Fixing spelling errors

Parsing custom data formats

Validating data with Valip

Managing Complexity with Concurrent Programming

Managing Complexity with Concurrent Programming

Managing program complexity with STM

Managing program complexity with agents

Getting better performance with commute

Combining agents and STM

Maintaining consistency with ensure

Introducing safe side effects into the STM

Maintaining data consistency with validators

Monitoring processing with watchers

Debugging concurrent programs with watchers

Recovering from errors in agents

Managing large inputs with sized queues

Improving Performance with Parallel Programming

Improving Performance with Parallel Programming

Parallelizing processing with pmap

Parallelizing processing with Incanter

Partitioning Monte Carlo simulations for better pmap performance

Finding the optimal partition size with simulated annealing

Combining function calls with reducers

Parallelizing with reducers

Generating online summary statistics for data streams with reducers

Using type hints

Benchmarking with Criterium

Distributed Data Processing with Cascalog

Distributed Data Processing with Cascalog

Initializing Cascalog and Hadoop for distributed processing

Querying data with Cascalog

Distributing data with Apache HDFS

Parsing CSV files with Cascalog

Executing complex queries with Cascalog

Aggregating data with Cascalog

Defining new Cascalog operators

Composing Cascalog queries

Transforming data with Cascalog

Working with Incanter Datasets

Working with Incanter Datasets

Loading Incanter's sample datasets

Loading Clojure data structures into datasets

Viewing datasets interactively with view

Converting datasets to matrices

Using infix formulas in Incanter

Selecting columns with $

Selecting rows with $

Filtering datasets with $where

Grouping data with $group-by

Saving datasets to CSV and JSON

Projecting from multiple datasets with $join

Statistical Data Analysis with Incanter

Statistical Data Analysis with Incanter

Generating summary statistics with $rollup

Working with changes in values

Scaling variables to simplify variable relationships

Working with time series data with Incanter Zoo

Smoothing variables to decrease variation

Validating sample statistics with bootstrapping

Modeling linear relationships

Modeling non-linear relationships

Modeling multinomial Bayesian distributions

Finding data errors with Benford's law

Working with Mathematica and R

Working with Mathematica and R

Setting up Mathematica to talk to Clojuratica for Mac OS X and Linux

Setting up Mathematica to talk to Clojuratica for Windows

Calling Mathematica functions from Clojuratica

Sending matrixes to Mathematica from Clojuratica

Evaluating Mathematica scripts from Clojuratica

Creating functions from Mathematica

Setting up R to talk to Clojure

Calling R functions from Clojure

Passing vectors into R

Evaluating R files from Clojure

Plotting in R from Clojure

Clustering, Classifying, and Working with Weka

Clustering, Classifying, and Working with Weka

Loading CSV and ARFF files into Weka

Filtering, renaming, and deleting columns in Weka datasets

Discovering groups of data using K-Means clustering

Finding hierarchical clusters in Weka

Clustering with SOMs in Incanter

Classifying data with decision trees

Classifying data with the Naive Bayesian classifier

Classifying data with support vector machines

Finding associations in data with the Apriori algorithm

Working with Unstructured and Textual Data

Working with Unstructured and Textual Data

Tokenizing text

Finding sentences

Focusing on content words with stoplists

Getting document frequencies

Scaling document frequencies by document size

Scaling document frequencies with TF-IDF

Finding people, places, and things with Named Entity Recognition

Mapping documents to a sparse vector space representation

Performing topic modeling with MALLET

Performing naïve Bayesian classification with MALLET

Graphing in Incanter

Graphing in Incanter

Creating scatter plots with Incanter

Graphing non-numeric data in bar charts

Creating histograms with Incanter

Creating function plots with Incanter

Adding equations to Incanter charts

Adding lines to scatter charts

Customizing charts with JFreeChart

Customizing chart colors and styles

Saving Incanter graphs to PNG

Using PCA to graph multi-dimensional data

Creating dynamic charts with Incanter

Creating Charts for the Web

Creating Charts for the Web

Serving data with Ring and Compojure

Creating HTML with Hiccup

Setting up to use ClojureScript

Creating scatter plots with NVD3

Creating bar charts with NVD3

Creating histograms with NVD3

Creating time series charts with D3

Visualizing graphs with force-directed layouts

Creating interactive visualizations with D3

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Tokenizing text

Before we can do any real analysis of a text or a corpus of texts, we have to identify the words in the text. This process is called tokenization. The output of this process is a list of words, and possibly includes punctuation in a text. This is different from tokenizing formal languages such as programming languages: it is meant to work with natural languages and its results are less structured.

It's easy to write your own tokenizer, but there are a lot of edge and corner cases to take into consideration and account for. It's also easy to include a natural language processing (NLP) library that includes one or more tokenizers. In this recipe, we'll use the OpenNLP (http://opennlp.apache.org/) and its Clojure wrapper (https://clojars.org/clojure-opennlp).

Getting ready

We'll need to include the clojure-opennlp in our project.clj file:

(defproject com.ericrochester/text-data "0.1.0-SNAPSHOT"
  :dependencies [[org.clojure/clojure "1.6.0"]
                 [clojure-opennlp "0.3...