Clojure for Data Science

Book Image

Clojure for Data Science

By : Henry Garner

Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Clojure for Data Science

Clojure for Data Science

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Descriptive statistics

The normal distribution

Poincaré's baker

Comparative visualizations

The importance of visualizations

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The central limit theorem

Samples and populations

Confidence intervals

Visualizing different populations

Hypothesis testing

Testing a new site design

The t-statistic

Performing the t-test

One-sample t-test

Testing multiple designs

Multiple comparisons

The browser simulation

B1

Plotting probability densities

State and Reagent

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

Correlation

Inspecting the data

Visualizing the data

The log-normal distribution

Pearson's correlation

Hypothesis testing

Confidence intervals

Ordinary least squares

Goodness-of-fit and R-square

Multiple linear regression

The normal equation

Multiple R-squared

Adjusted R-squared

Classification

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

The binomial distribution

Significance testing proportions

Chi-squared multiple significance testing

Classification with logistic regression

Implementing logistic regression with Incanter

Naive Bayes classification

Classification with clj-ml

Bias and variance

Ensemble learning and random forests

Saving the classifier to a file

Big Data

Downloading the code and data

The reducers library

Mathematical folds with Tesser

Multiple regression with gradient descent

Scaling gradient descent with Hadoop

Stochastic gradient descent

Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Creating term frequency vectors

Clustering with k-means and Incanter

Better clustering with TF-IDF

Large-scale clustering with Mahout

Running k-means clustering with Mahout

Cluster evaluation measures

The drawbacks of k-means

The curse of dimensionality

Recommender Systems

Recommender Systems

Download the code and data

Inspect the data

Types of recommender systems

Item-based and user-based recommenders

Slope One recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Probabilistic methods for large sets

Jaccard similarity for large sets with MinHash

Dimensionality reduction

Large-scale machine learning with Apache Spark and MLlib

Machine learning on Spark with MLlib

Network Analysis

Network Analysis

Download the data

Graph traversal with Loom

Breadth-first and depth-first search

Finding the shortest path

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Time Series

Fitting curves with a linear model

Time series decomposition

Discrete time models

Maximum likelihood estimation

Time series forecasting

Visualization

Download the code and data

Exploratory data visualization

Using Quil for visualization

Visualization for communication

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Goodness-of-fit and R-square

Although we can see from the residual plot that a linear model is a good fit for our data, it would be desirable to quantify just how good it is. Also called the coefficient of determination, R² varies between zero and one and indicates the explanatory power of the linear regression model. It calculates the proportion of variation in the dependent variable explained, or accounted for, by the independent variable.

Generally, the closer R² is to 1, the better the regression line fits the points and the more the variation in Y is explained by X. R² can be calculated using the following formula:

Here, var(ε) is the variance of the residuals and var(Y) is the variance in Y. To understand what this means, let's suppose you're trying to guess someone's weight. If you don't know anything else about them, your best strategy would be to guess the mean of the weights within the population in general. This way, the mean squared error of your guess compared to their true weight...