Haskell Data Analysis Cookbook

Book Image

Haskell Data Analysis Cookbook

By : Nishant Shukla

Book Image

Haskell Data Analysis Cookbook

By: Nishant Shukla

Overview of this book

Haskell Data Analysis Cookbook

Haskell Data Analysis Cookbook

Credits

About the Author

About the Author

About the Reviewers

About the Reviewers

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

The Hunt for Data

The Hunt for Data

Harnessing data from various sources

Accumulating text data from a file path

Catching I/O code faults

Keeping and representing data from a CSV file

Examining a JSON file with the aeson package

Reading an XML file using the HXT package

Capturing table rows from an HTML page

Understanding how to perform HTTP GET requests

Learning how to perform HTTP POST requests

Traversing online directories for data

Using MongoDB queries in Haskell

Reading from a remote MongoDB server

Exploring data from a SQLite database

Integrity and Inspection

Integrity and Inspection

Trimming excess whitespace

Ignoring punctuation and specific characters

Coping with unexpected or missing input

Validating records by matching regular expressions

Lexing and parsing an e-mail address

Deduplication of nonconflicting data items

Deduplication of conflicting data items

Implementing a frequency table using Data.List

Implementing a frequency table using Data.MultiSet

Computing the Manhattan distance

Computing the Euclidean distance

Comparing scaled data using the Pearson correlation coefficient

Comparing sparse data using cosine similarity

The Science of Words

The Science of Words

Displaying a number in another base

Reading a number from another base

Searching for a substring using Data.ByteString

Searching a string using the Boyer-Moore-Horspool algorithm

Searching a string using the Rabin-Karp algorithm

Splitting a string on lines, words, or arbitrary tokens

Finding the longest common subsequence

Computing a phonetic code

Computing the edit distance

Computing the Jaro-Winkler distance between two strings

Finding strings within one-edit distance

Fixing spelling mistakes

Data Hashing

Hashing a primitive data type

Hashing a custom data type

Running popular cryptographic hash functions

Running a cryptographic checksum on a file

Performing fast comparisons between data types

Using a high-performance hash table

Using Google's CityHash hash functions for strings

Computing a Geohash for location coordinates

Using a bloom filter to remove unique items

Running MurmurHash, a simple but speedy hashing algorithm

Measuring image similarity with perceptual hashes

The Dance with Trees

The Dance with Trees

Defining a binary tree data type

Defining a rose tree (multiway tree) data type

Traversing a tree depth-first

Traversing a tree breadth-first

Implementing a Foldable instance for a tree

Calculating the height of a tree

Implementing a binary search tree data structure

Verifying the order property of a binary search tree

Using a self-balancing tree

Implementing a min-heap data structure

Encoding a string using a Huffman tree

Decoding a Huffman code

Graph Fundamentals

Graph Fundamentals

Representing a graph from a list of edges

Representing a graph from an adjacency list

Conducting a topological sort on a graph

Traversing a graph depth-first

Traversing a graph breadth-first

Visualizing a graph using Graphviz

Using Directed Acyclic Word Graphs

Working with hexagonal and square grid networks

Finding maximal cliques in a graph

Determining whether any two graphs are isomorphic

Statistics and Analysis

Statistics and Analysis

Calculating a moving average

Calculating a moving median

Approximating a linear regression

Approximating a quadratic regression

Obtaining the covariance matrix from samples

Finding all unique pairings in a list

Using the Pearson correlation coefficient

Evaluating a Bayesian network

Creating a data structure for playing cards

Using a Markov chain to generate text

Creating n-grams from a list

Creating a neural network perceptron

Clustering and Classification

Clustering and Classification

Implementing the k-means clustering algorithm

Implementing hierarchical clustering

Using a hierarchical clustering library

Finding the number of clusters

Clustering words by their lexemes

Classifying the parts of speech of words

Identifying key words in a corpus of text

Training a parts-of-speech tagger

Implementing a decision tree classifier

Implementing a k-Nearest Neighbors classifier

Visualizing points using Graphics.EasyPlot

Parallel and Concurrent Design

Parallel and Concurrent Design

Using the Haskell Runtime System options

Evaluating a procedure in parallel

Controlling parallel algorithms in sequence

Forking I/O actions for concurrency

Communicating with a forked I/O action

Killing forked threads

Parallelizing pure functions using the Par monad

Mapping over a list in parallel

Accessing tuple elements in parallel

Implementing MapReduce to count word frequencies

Manipulating images in parallel using Repa

Benchmarking runtime performance in Haskell

Using the criterion package to measure performance

Benchmarking runtime performance in the terminal

Real-time Data

Streaming Twitter for real-time sentiment analysis

Reading IRC chat room messages

Responding to IRC messages

Polling a web server for latest updates

Detecting real-time file directory changes

Communicating in real time through sockets

Detecting faces and eyes through a camera stream

Streaming camera frames for template matching

Visualizing Data

Visualizing Data

Plotting a line chart using Google's Chart API

Plotting a pie chart using Google's Chart API

Plotting bar graphs using Google's Chart API

Displaying a line graph using gnuplot

Displaying a scatter plot of two-dimensional points

Interacting with points in a three-dimensional space

Visualizing a graph network

Customizing the looks of a graph network diagram

Rendering a bar graph in JavaScript using D3.js

Rendering a scatter plot in JavaScript using D3.js

Diagramming a path from a list of vectors

Exporting and Presenting

Exporting and Presenting

Exporting data to a CSV file

Exporting data as JSON

Using SQLite to store data

Saving data to a MongoDB database

Presenting results in an HTML web page

Creating a LaTeX table to display results

Personalizing messages using a text template

Exporting matrix values to a file

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Implementing MapReduce to count word frequencies

MapReduce is a framework for efficient parallel algorithms that take advantage of divide and conquer. If a task can be split into smaller tasks, and the results of each individual task can be combined to form the final answer, then MapReduce is likely the best framework for this job.

In the following figure, we can see that a large list is split up, and the mapper functions work in parallel on each split. After all the mapping is complete, the second phase of the framework kicks in, reducing the various calculations into one final answer.

In this recipe, we will be counting word frequencies in a large corpus of text. Given many files of words, we will apply the MapReduce framework to find the word frequencies in parallel.

Getting ready

Install the parallel package using cabal as follows:

$ cabal install parallel

Create multiple files with words. In this recipe, we download a huge text file and split it up using the UNIX split command as follows...