Clojure for Data Science

Book Image

Clojure for Data Science

By : Henry Garner

Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Clojure for Data Science

Clojure for Data Science

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Descriptive statistics

The normal distribution

Poincaré's baker

Comparative visualizations

The importance of visualizations

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The central limit theorem

Samples and populations

Confidence intervals

Visualizing different populations

Hypothesis testing

Testing a new site design

The t-statistic

Performing the t-test

One-sample t-test

Testing multiple designs

Multiple comparisons

The browser simulation

B1

Plotting probability densities

State and Reagent

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

Correlation

Inspecting the data

Visualizing the data

The log-normal distribution

Pearson's correlation

Hypothesis testing

Confidence intervals

Ordinary least squares

Goodness-of-fit and R-square

Multiple linear regression

The normal equation

Multiple R-squared

Adjusted R-squared

Classification

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

The binomial distribution

Significance testing proportions

Chi-squared multiple significance testing

Classification with logistic regression

Implementing logistic regression with Incanter

Naive Bayes classification

Classification with clj-ml

Bias and variance

Ensemble learning and random forests

Saving the classifier to a file

Big Data

Downloading the code and data

The reducers library

Mathematical folds with Tesser

Multiple regression with gradient descent

Scaling gradient descent with Hadoop

Stochastic gradient descent

Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Creating term frequency vectors

Clustering with k-means and Incanter

Better clustering with TF-IDF

Large-scale clustering with Mahout

Running k-means clustering with Mahout

Cluster evaluation measures

The drawbacks of k-means

The curse of dimensionality

Recommender Systems

Recommender Systems

Download the code and data

Inspect the data

Types of recommender systems

Item-based and user-based recommenders

Slope One recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Probabilistic methods for large sets

Jaccard similarity for large sets with MinHash

Dimensionality reduction

Large-scale machine learning with Apache Spark and MLlib

Machine learning on Spark with MLlib

Network Analysis

Network Analysis

Download the data

Graph traversal with Loom

Breadth-first and depth-first search

Finding the shortest path

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Time Series

Fitting curves with a linear model

Time series decomposition

Discrete time models

Maximum likelihood estimation

Time series forecasting

Visualization

Download the code and data

Exploratory data visualization

Using Quil for visualization

Visualization for communication

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Hypothesis testing

In the previous chapter, we introduced hypothesis testing as a means to quantify the probability that a given hypothesis (such as that the two samples were from a single population) is true. We will use the same process to quantify the probability that a correlation exists in the wider population based on our sample.

First, we must formulate two hypotheses, a null hypothesis and an alternate hypothesis:

H₀ is the hypothesis that the population correlation is zero. In other words, our conservative view is that the measured correlation is purely due to chance sampling error.

H₁ is the alternative possibility that the population correlation is not zero. Notice that we don't specify the direction of the correlation, only that there is one. This means we are performing a two-tailed test.

The standard error of the sample r is given by:

This formula is only accurate when is close to zero (recall that the magnitude of r influences our confidence), but fortunately, this is exactly...