Book Image

Learning Haskell Data Analysis

By : James Church
Book Image

Learning Haskell Data Analysis

By: James Church

Overview of this book

<p>Haskell is trending in the field of data science by providing a powerful platform for robust data science practices. This book provides you with the skills to handle large amounts of data, even if that data is in a less than perfect state. Each chapter in the book helps to build a small library of code that will be used to solve a problem for that chapter. The book starts with creating databases out of existing datasets, cleaning that data, and interacting with databases within Haskell in order to produce charts for publications. It then moves towards more theoretical concepts that are fundamental to introductory data analysis, but in a context of a real-world problem with real-world data. As you progress in the book, you will be relying on code from previous chapters in order to help create new solutions quickly. By the end of the book, you will be able to manipulate, find, and analyze large and small sets of data using your own Haskell libraries.</p>
Table of Contents (16 chapters)
Learning Haskell Data Analysis
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Getting ready


You will need to install the Haskell platform, which is available on all three major operating systems: Windows, Mac, and Linux. I primarily work with Debian Linux. Linux has the benefit of being equipped with a versatile command line, which can facilitate almost everything that is essential to the data analysis process. From the command line, we can download software, install Haskell libraries, download datasets, write files, and view raw datasets. An essential activity that the command line cannot do for us is the rendering of graphics that can be provided with sufficient detail to inspect rendered charts of our analyses.

Installing the Haskell platform on Linux

On Ubuntu- and Debian-based systems, you can install the Haskell platform using apt-get, as follows:

$ sudo apt-get install haskell-platform 

This single command will install everything that is needed to get started, including the compiler (ghc), interactive command line (ghci), and the library install tool (cabal). Take a moment to test the following commands:

$ ghc --version 
The Glorious Glasgow Haskell Compilation System, version 7.4.1 
$ ghci --version 
The Glorious Glasgow Haskell Compilation System, version 7.4.1 

If you get back the version numbers for the Haskell compiler and the Haskell interactive prompt, you should be all set. However, we do need to perform some housekeeping with regards to cabal. We will use cabal throughout this book, and it will require an update immediately. We updated the cabal tool through cabal itself.

First, we will update the Haskell package list from Hackage using the update directive by using the following command:

$ cabal update

Next, we will download cabal using the cabal-install command. This command will not overwrite the existing cabal program. Instead, it will download an updated cabal to your home folder, which can be found at ~/.cabal/bin/cabal.

$ cabal install cabal-install

Your system has two versions of cabal on it. We created an alias command to make sure that we only use the updated version of cabal. This is a temporary alias command. You should add the following line to one of your configuration files in your home directory. (We added ours to ~/.bash_aliases and reloaded aliases with source ~/.bash_aliases.)

$ alias cabal='~/.cabal/bin/cabal'

If all goes according to plan, you will have an updated version of cabal on your system. Here is the version of cabal used at the time of writing this book:

$ cabal --version
cabal-install version 1.22.0.0
using version 1.22.0.0 of the Cabal library

If you use cabal long enough, you may run into problems. Rather than going into a prolonged discussion on how to manage Haskell packages, it is easier to start over with a clean slate. Your packages are downloaded to a folder under ~/.cabal, and they are registered with the Haskell environment under the ~/.ghc/ directory. If you find that a package has not been installed due to a conflicted dependency, you can spend an evening reading the package documentation to figure out which packages need to be removed or installed. Alternatively, you can use the following command and wipe the slate clean:

$ rm -rf ~/.ghc

The preceding command wipes out all your installed Haskell packages. We can promise that you will not have conflicting packages if you have no packages. We call this the Break Glass In Case of Emergency solution. This is obviously is not the best solution, but it is a solution that gets your necessary packages installed. You have more important things to do than wrestle with cabal. While it may take about an hour or so to download and install packages with this approach, this approach is less stressful than the process of going through package version numbers.

The software used in addition to Haskell

There are three open source software packages used in this book that work alongside the Haskell platform. If you are using Debian or Ubuntu, you will be able to download each of these packages using the apt-get command-line tool. The instructions on how to download and install these packages will be introduced when the software is needed. If you are using Windows or Mac, you will have to consult the documentation for these software packages for an installation on your system.

SQLite3

SQLite3 (for more information refer to: https://sqlite.org/) is a standalone Structured Query Language (SQL) database engine. We use SQLite3 to filter and organize large amounts of data. It requires no configuration, does not use a background server process, and each database is self-contained in a single file ending with the .sql extension. The software is portable, has many features from the features found in sever-based SQL database engines, and can support large databases. We will introduce SQLite3 in Chapter 2, Getting Our Feet Wet and use it extensively in the rest of the book.

Gnuplot

Gnuplot (for more information refer to: http://www.gnuplot.info/) is a command-line tool that can be used to create charts and graphs for academic publications. It supports many features related to 2D and 3D plotting as well as a number of output and interactive formats. We will use gnuplot in conjunction with the EasyPlot Haskell wrapper module. EasyPlot gives us access to a subset of the features of gnuplot (which means that even though our charts are being piped through gnuplot, we will not be able to utilize the full power of gnuplot from within this library). Every chart presented in this book was created using EasyPlot and gnuplot. We will introduce EasyPlot and gnuplot in Chapter 4, Plotting.

LAPACK

LAPACK (short for Linear Algebra PACKage) (for more information refer to: http://www.netlib.org/lapack/) has been constantly developed since the early 1990s. To this day, this library is written in FORTRAN. Since it is so vital to science, it is funded through the United States National Science Foundation (NSF). This library supports routines related to systems of equations such as matrix multiplication, matrix inversion, and eigenvalue decomposition. We will use the hmatrix wrapper for LAPACK in Chapter 8, Building a Recommendation Engine to write our own Principal Component Analysis (PCA) function to create a recommendation engine. We will also use LAPACK to avoid the messiness that comes when trying to write an eigensolver ourselves.