Clojure for Data Science

Book Image

Clojure for Data Science

By : Henry Garner

Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Clojure for Data Science

Clojure for Data Science

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Descriptive statistics

The normal distribution

Poincaré's baker

Comparative visualizations

The importance of visualizations

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The central limit theorem

Samples and populations

Confidence intervals

Visualizing different populations

Hypothesis testing

Testing a new site design

The t-statistic

Performing the t-test

One-sample t-test

Testing multiple designs

Multiple comparisons

The browser simulation

B1

Plotting probability densities

State and Reagent

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

Correlation

Inspecting the data

Visualizing the data

The log-normal distribution

Pearson's correlation

Hypothesis testing

Confidence intervals

Ordinary least squares

Goodness-of-fit and R-square

Multiple linear regression

The normal equation

Multiple R-squared

Adjusted R-squared

Classification

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

The binomial distribution

Significance testing proportions

Chi-squared multiple significance testing

Classification with logistic regression

Implementing logistic regression with Incanter

Naive Bayes classification

Classification with clj-ml

Bias and variance

Ensemble learning and random forests

Saving the classifier to a file

Big Data

Downloading the code and data

The reducers library

Mathematical folds with Tesser

Multiple regression with gradient descent

Scaling gradient descent with Hadoop

Stochastic gradient descent

Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Creating term frequency vectors

Clustering with k-means and Incanter

Better clustering with TF-IDF

Large-scale clustering with Mahout

Running k-means clustering with Mahout

Cluster evaluation measures

The drawbacks of k-means

The curse of dimensionality

Recommender Systems

Recommender Systems

Download the code and data

Inspect the data

Types of recommender systems

Item-based and user-based recommenders

Slope One recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Probabilistic methods for large sets

Jaccard similarity for large sets with MinHash

Dimensionality reduction

Large-scale machine learning with Apache Spark and MLlib

Machine learning on Spark with MLlib

Network Analysis

Network Analysis

Download the data

Graph traversal with Loom

Breadth-first and depth-first search

Finding the shortest path

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Time Series

Fitting curves with a linear model

Time series decomposition

Discrete time models

Maximum likelihood estimation

Time series forecasting

Visualization

Download the code and data

Exploratory data visualization

Using Quil for visualization

Visualization for communication

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Mathematical folds with Tesser

We should now understand how to use folds to calculate parallel implementations of simple algorithms. Hopefully, we should also have some appreciation for the ingenuity required to find efficient solutions that will perform the minimum number of iterations over the data.

Fortunately, the Clojure library Tesser (https://github.com/aphyr/tesser) includes implementations for common mathematical folds, including the mean, standard deviation, and covariance. To see how to use Tesser, let's consider the covariance of two fields from the IRS dataset: the salaries and wages, A00200, the unemployment compensation, A02300.

Calculating covariance with Tesser

We encountered covariance in Chapter 3, Correlation, as a measure of how two sequences of data vary together. The formula is reproduced as follows:

A covariance fold is included in tesser.math. In the following code, we'll include tesser.math as m and tesser.core as t:

(defn ex-5-17 []
  (let [data (into [] (load-data...