4. Classification | Clojure for Data Science

Book Overview & Buying
Table Of Contents

Clojure for Data Science

By : Henry Garner

5 (4)

Buy this Book

Clojure for Data Science

5 (4)

By: Henry Garner

Buy this Book

Overview of this book

The term “data science” has been widely used to define this new profession that is expected to interpret vast datasets and translate them to improved decision-making and performance. Clojure is a powerful language that combines the interactivity of a scripting language with the speed of a compiled language. Together with its rich ecosystem of native libraries and an extremely simple and consistent functional approach to data manipulation, which maps closely to mathematical formula, it is an ideal, practical, and flexible language to meet a data scientist’s diverse needs. Taking you on a journey from simple summary statistics to sophisticated machine learning algorithms, this book shows how the Clojure programming language can be used to derive insights from data. Data scientists often forge a novel path, and you’ll see how to make use of Clojure’s Java interoperability capabilities to access libraries such as Mahout and Mllib for which Clojure wrappers don’t yet exist. Even seasoned Clojure developers will develop a deeper appreciation for their language’s flexibility! You’ll learn how to apply statistical thinking to your own data and use Clojure to explore, analyze, and visualize it in a technically and statistically robust way. You can also use Incanter for local data processing and ClojureScript to present interactive visualisations and understand how distributed platforms such as Hadoop sand Spark’s MapReduce and GraphX’s BSP solve the challenges of data analysis at scale, and how to explain algorithms using those programming models. Above all, by following the explanations in this book, you’ll learn not just how to be effective using the current state-of-the-art methods in data science, but why such methods work so that you can continue to be productive as the field evolves into the future.

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Data scrubbing

Descriptive statistics

Variance

Quantiles

Binning data

Histograms

The normal distribution

Poincaré's baker

Skewness

Comparative visualizations

The importance of visualizations

Adding columns

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Summary

2. Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The central limit theorem

Standard error

Samples and populations

Confidence intervals

Visualizing different populations

Hypothesis testing

Testing a new site design

The t-statistic

Performing the t-test

One-sample t-test

Resampling

Testing multiple designs

Multiple comparisons

The browser simulation

jStat

Plotting probability densities

State and Reagent

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

The F-test

Effect size

Summary

3. Correlation

About the data

Inspecting the data

Visualizing the data

The log-normal distribution

Covariance

Pearson's correlation

Hypothesis testing

Confidence intervals

Regression

Ordinary least squares

Goodness-of-fit and R-square

Multiple linear regression

Matrices

The normal equation

Multiple R-squared

Adjusted R-squared

Collinearity

Prediction

Summary

4. Classification

About the data

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

The binomial distribution

Significance testing proportions

Chi-squared multiple significance testing

Classification with logistic regression

Implementing logistic regression with Incanter

Probability

Naive Bayes classification

Decision trees

Classification with clj-ml

Bias and variance

Ensemble learning and random forests

Saving the classifier to a file

Summary

5. Big Data

Downloading the code and data

The reducers library

Mathematical folds with Tesser

Multiple regression with gradient descent

Scaling gradient descent with Hadoop

Stochastic gradient descent

Summary

6. Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Creating term frequency vectors

Clustering with k-means and Incanter

Better clustering with TF-IDF

Large-scale clustering with Mahout

Running k-means clustering with Mahout

Cluster evaluation measures

The drawbacks of k-means

The curse of dimensionality

Summary

7. Recommender Systems

Download the code and data

Inspect the data

Parse the data

Types of recommender systems

Item-based and user-based recommenders

Slope One recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Probabilistic methods for large sets

Jaccard similarity for large sets with MinHash

Dimensionality reduction

Large-scale machine learning with Apache Spark and MLlib

Machine learning on Spark with MLlib

Summary

8. Network Analysis

Download the data

Graph traversal with Loom

Breadth-first and depth-first search

Finding the shortest path

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Summary

9. Time Series

About the data

Fitting curves with a linear model

Time series decomposition

Discrete time models

Maximum likelihood estimation

Time series forecasting

Summary

10. Visualization

Download the code and data

Exploratory data visualization

Using Quil for visualization

Visualization for communication

Summary

Index

Clojure for Data Science

By : Henry Garner

Clojure for Data Science

By: Henry Garner

Overview of this book

The binomial distribution

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access