Book Image

Scala Data Analysis Cookbook

By : Arun Manivannan
Book Image

Scala Data Analysis Cookbook

By: Arun Manivannan

Overview of this book

This book will introduce you to the most popular Scala tools, libraries, and frameworks through practical recipes around loading, manipulating, and preparing your data. It will also help you explore and make sense of your data using stunning and insightfulvisualizations, and machine learning toolkits. Starting with introductory recipes on utilizing the Breeze and Spark libraries, get to grips withhow to import data from a host of possible sources and how to pre-process numerical, string, and date data. Next, you’ll get an understanding of concepts that will help you visualize data using the Apache Zeppelin and Bokeh bindings in Scala, enabling exploratory data analysis. iscover how to program quintessential machine learning algorithms using Spark ML library. Work through steps to scale your machine learning models and deploy them into a standalone cluster, EC2, YARN, and Mesos. Finally dip into the powerful options presented by Spark Streaming, and machine learning for streaming data, as well as utilizing Spark GraphX.
Table of Contents (14 chapters)
Scala Data Analysis Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Getting Breeze – the linear algebra library


In simple terms, Breeze (http://www.scalanlp.org) is a Scala library that extends the Scala collection library to provide support for vectors and matrices in addition to providing a whole bunch of functions that support their manipulation. We could safely compare Breeze to NumPy (http://www.numpy.org/) in Python terms. Breeze forms the foundation of MLlib—the Machine Learning library in Spark, which we will explore in later chapters.

In this first recipe, we will see how to pull the Breeze libraries into our project using Scala Build Tool (SBT). We will also see a brief history of Breeze to better appreciate why it could be considered as the "go to" linear algebra library in Scala.

Note

For all our recipes, we will be using Scala 2.10.4 along with Java 1.7. I wrote the examples using the Scala IDE, but please feel free to use your favorite IDE.

How to do it...

Let's add the Breeze dependencies into our build.sbt so that we can start playing with them in the subsequent recipes. The Breeze dependencies are just two—the breeze (core) and the breeze-native dependencies.

  1. Under a brand new folder (which will be our project root), create a new file called build.sbt.

  2. Next, add the breeze libraries to the project dependencies:

    organization := "com.packt"
    
    name := "chapter1-breeze"
    
    scalaVersion := "2.10.4"
    
    libraryDependencies  ++= Seq(
      "org.scalanlp" %% "breeze" % "0.11.2",
      //Optional - the 'why' is explained in the How it works section
      "org.scalanlp" %% "breeze-natives" % "0.11.2"
    )
  3. From that folder, issue a sbt compile command in order to fetch all your dependencies.

    Note

    You could import the project into your Eclipse using sbt eclipse after installing the sbteclipse plugin https://github.com/typesafehub/sbteclipse/. For IntelliJ IDEA, you just need to import the project by pointing to the root folder where your build.sbt file is.

There's more...

Let's look into the details of what the breeze and breeze-native library dependencies we added bring to us.

The org.scalanlp.breeze dependency

Breeze has a long history in that it isn't written from scratch in Scala. Without the native dependency, Breeze leverages the power of netlib-java that has a Java-compiled version of the FORTRAN Reference implementation of BLAS/LAPACK. The netlib-java also provides gentle wrappers over the Java compiled library. What this means is that we could still work without the native dependency but the performance won't be great considering the best performance that we could leverage out of this FORTRAN-translated library is the performance of the FORTRAN reference implementation itself. However, for serious number crunching with the best performance, we should add the breeze-natives dependency too.

The org.scalanlp.breeze-natives package

With its native additive, Breeze looks for the machine-specific implementations of the BLAS/LAPACK libraries. The good news is that there are open source and (vendor provided) commercial implementations for most popular processors and GPUs. The most popular open source implementations include ATLAS (http://math-atlas.sourceforge.net) and OpenBLAS (http://www.openblas.net/).

If you are running a Mac, you are in luck—Native BLAS libraries come out of the box on Macs. Installing NativeBLAS on Ubuntu / Debian involves just running the following commands:

sudo apt-get install libatlas3-base libopenblas-base
sudo update-alternatives --config libblas.so.3
sudo update-alternatives --config liblapack.so.3

Tip

Downloading the example code

You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

For Windows, please refer to the installation instructions on https://github.com/xianyi/OpenBLAS/wiki/Installation-Guide.