Clojure for Data Science

Book Image

Clojure for Data Science

By : Henry Garner

Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Clojure for Data Science

Clojure for Data Science

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Descriptive statistics

The normal distribution

Poincaré's baker

Comparative visualizations

The importance of visualizations

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The central limit theorem

Samples and populations

Confidence intervals

Visualizing different populations

Hypothesis testing

Testing a new site design

The t-statistic

Performing the t-test

One-sample t-test

Testing multiple designs

Multiple comparisons

The browser simulation

B1

Plotting probability densities

State and Reagent

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

Correlation

Inspecting the data

Visualizing the data

The log-normal distribution

Pearson's correlation

Hypothesis testing

Confidence intervals

Ordinary least squares

Goodness-of-fit and R-square

Multiple linear regression

The normal equation

Multiple R-squared

Adjusted R-squared

Classification

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

The binomial distribution

Significance testing proportions

Chi-squared multiple significance testing

Classification with logistic regression

Implementing logistic regression with Incanter

Naive Bayes classification

Classification with clj-ml

Bias and variance

Ensemble learning and random forests

Saving the classifier to a file

Big Data

Downloading the code and data

The reducers library

Mathematical folds with Tesser

Multiple regression with gradient descent

Scaling gradient descent with Hadoop

Stochastic gradient descent

Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Creating term frequency vectors

Clustering with k-means and Incanter

Better clustering with TF-IDF

Large-scale clustering with Mahout

Running k-means clustering with Mahout

Cluster evaluation measures

The drawbacks of k-means

The curse of dimensionality

Recommender Systems

Recommender Systems

Download the code and data

Inspect the data

Types of recommender systems

Item-based and user-based recommenders

Slope One recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Probabilistic methods for large sets

Jaccard similarity for large sets with MinHash

Dimensionality reduction

Large-scale machine learning with Apache Spark and MLlib

Machine learning on Spark with MLlib

Network Analysis

Network Analysis

Download the data

Graph traversal with Loom

Breadth-first and depth-first search

Finding the shortest path

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Time Series

Fitting curves with a linear model

Time series decomposition

Discrete time models

Maximum likelihood estimation

Time series forecasting

Visualization

Download the code and data

Exploratory data visualization

Using Quil for visualization

Visualization for communication

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Distributed graph computation with GraphX

GraphX (https://spark.apache.org/graphx/) is a distributed graph processing library that is designed to work with Spark. Like the MLlib library we used in the previous chapter, GraphX provides a set of abstractions that are built on top of Spark's RDDs. By representing the vertices and edges of a graph as RDDs, GraphX is able to process very large graphs in a scalable way.

We've seen in previous chapters how to process a large dataset using MapReduce and Hadoop. Hadoop is an example of a data-parallel system: the dataset is divided into groups that are processed in parallel. Spark is also a data-parallel system: RDDs are distributed across the cluster and processed in parallel.

Data-parallel systems are appropriate ways of scaling data processing when your data closely resembles a table. Graphs, which may have complex internal structure, are not most efficiently represented as tables. Although graphs can be represented as edge lists, as we've seen...