Clojure for Data Science

Book Image

Clojure for Data Science

By : Henry Garner

Book Image

Clojure for Data Science

By: Henry Garner

Overview of this book

Clojure for Data Science

Clojure for Data Science

Credits

About the Author

About the Author

Acknowledgments

Acknowledgments

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

Statistics

Downloading the sample code

Running the examples

Downloading the data

Inspecting the data

Descriptive statistics

The normal distribution

Poincaré's baker

Comparative visualizations

The importance of visualizations

Comparative visualizations of electorate data

Visualizing the Russian election data

Comparative visualizations

Inference

Introducing AcmeContent

Download the sample code

Load and inspect the data

Visualizing the dwell times

The exponential distribution

The central limit theorem

Samples and populations

Confidence intervals

Visualizing different populations

Hypothesis testing

Testing a new site design

The t-statistic

Performing the t-test

One-sample t-test

Testing multiple designs

Multiple comparisons

The browser simulation

B1

Plotting probability densities

State and Reagent

Simulating multiple tests

The Bonferroni correction

Analysis of variance

The F-distribution

The F-statistic

Correlation

Inspecting the data

Visualizing the data

The log-normal distribution

Pearson's correlation

Hypothesis testing

Confidence intervals

Ordinary least squares

Goodness-of-fit and R-square

Multiple linear regression

The normal equation

Multiple R-squared

Adjusted R-squared

Classification

Inspecting the data

Comparisons with relative risk and odds

The standard error of a proportion

The binomial distribution

Significance testing proportions

Chi-squared multiple significance testing

Classification with logistic regression

Implementing logistic regression with Incanter

Naive Bayes classification

Classification with clj-ml

Bias and variance

Ensemble learning and random forests

Saving the classifier to a file

Big Data

Downloading the code and data

The reducers library

Mathematical folds with Tesser

Multiple regression with gradient descent

Scaling gradient descent with Hadoop

Stochastic gradient descent

Clustering

Downloading the data

Extracting the data

Inspecting the data

Clustering text

Creating term frequency vectors

Clustering with k-means and Incanter

Better clustering with TF-IDF

Large-scale clustering with Mahout

Running k-means clustering with Mahout

Cluster evaluation measures

The drawbacks of k-means

The curse of dimensionality

Recommender Systems

Recommender Systems

Download the code and data

Inspect the data

Types of recommender systems

Item-based and user-based recommenders

Slope One recommenders

Building a user-based recommender with Mahout

k-nearest neighbors

Recommender evaluation with Mahout

Probabilistic methods for large sets

Jaccard similarity for large sets with MinHash

Dimensionality reduction

Large-scale machine learning with Apache Spark and MLlib

Machine learning on Spark with MLlib

Network Analysis

Network Analysis

Download the data

Graph traversal with Loom

Breadth-first and depth-first search

Finding the shortest path

Whole-graph analysis

Scale-free networks

Distributed graph computation with GraphX

Time Series

Fitting curves with a linear model

Time series decomposition

Discrete time models

Maximum likelihood estimation

Time series forecasting

Visualization

Download the code and data

Exploratory data visualization

Using Quil for visualization

Visualization for communication

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Better clustering with TF-IDF

Term Frequency-Inverse Document Frequency (TF-IDF) is a general approach to weighting terms within a document vector so that terms that are popular across the whole dataset are not weighted as highly as terms that are less usual. This captures the intuitive conviction—and what we observed earlier—that words such as "said" are not a strong basis for building clusters.

Zipf's law

Zipf's law states that the frequency of any word is inversely proportional to its rank in the frequency table. Thus, the most frequent word will occur approximately twice as often as the second most frequent word and three times as often as the next most frequent word, and so on. Let's see if this applies across our Reuters corpus:

(defn ex-6-13 []
  (let [documents (fs/glob "data/reuters-text/*.txt")
        doc-count 1000
        top-terms 25
        term-frequencies (->> (map slurp documents)
                              (remove too-short?)
                              (take...