Scala for Data Science

Scala for Data Science

By : Pascal Bugnion

Buy this Book

Scala for Data Science

By: Pascal Bugnion

Buy this Book

Overview of this book

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines. This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala. Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala’s emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks. This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.

Scala for Data Science

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Scala and Data Science

Data science

Programming in data science

Why Scala?

When not to use Scala

Summary

References

Manipulating Data with Breeze

Code examples

Installing Breeze

Getting help on Breeze

Basic Breeze data types

An example – logistic regression

Towards re-usable code

Alternatives to Breeze

Summary

References

Plotting with breeze-viz

Diving into Breeze

Customizing plots

Customizing the line type

More advanced scatter plots

Multi-plot example – scatterplot matrix plots

Managing without documentation

Breeze-viz reference

Data visualization beyond breeze-viz

Summary

Parallel Collections and Futures

Parallel collections

Futures

Summary

References

Scala and SQL through JDBC

Interacting with JDBC

First steps with JDBC

JDBC summary

Functional wrappers for JDBC

Safer JDBC connections with the loan pattern

Enriching JDBC statements with the "pimp my library" pattern

Wrapping result sets in a stream

Looser coupling with type classes

Creating a data access layer

Summary

References

Slick – A Functional Interface for SQL

FEC data

Invokers

Operations on columns

Aggregations with "Group by"

Accessing database metadata

Slick versus JDBC

Summary

References

Web APIs

A whirlwind tour of JSON

Querying web APIs

JSON in Scala – an exercise in pattern matching

Extraction using case classes

Concurrency and exception handling with futures

Authentication – adding HTTP headers

Summary

References

Scala and MongoDB

MongoDB

Connecting to MongoDB with Casbah

Inserting documents

Extracting objects from the database

Complex queries

Casbah query DSL

Custom type serialization

Beyond Casbah

Summary

References

Concurrency with Akka

GitHub follower graph

Actors as people

Hello world with Akka

Case classes as messages

Actor construction

Anatomy of an actor

Follower network crawler

Fetcher actors

Routing

Message passing between actors

Queue control and the pull pattern

Accessing the sender of a message

Stateful actors

Follower network crawler

Fault tolerance

Custom supervisor strategies

Life-cycle hooks

What we have not talked about

Summary

References

Distributed Batch Processing with Spark

Installing Spark

Acquiring the example data

Resilient distributed datasets

Building and running standalone programs

Spam filtering

Lifting the hood

Data shuffling and partitions

Summary

Reference

Spark SQL and DataFrames

DataFrames – a whirlwind introduction

Aggregation operations

Joining DataFrames together

Custom functions on DataFrames

DataFrame immutability and persistence

SQL statements on DataFrames

Complex data types – arrays, maps, and structs

Interacting with data sources

Standalone programs

Summary

References

Distributed Machine Learning with MLlib

Introducing MLlib – Spam classification

Pipeline components

Evaluation

Regularization in logistic regression

Cross-validation and model selection

Beyond logistic regression

Summary

References

Web APIs with Play

Client-server applications

Introduction to web frameworks

Model-View-Controller architecture

Single page applications

Building an application

The Play framework

Dynamic routing

Actions

Interacting with JSON

Querying external APIs and consuming JSON

Creating APIs with Play: a summary

Rest APIs: best practice

Summary

References

Visualization with D3 and the Play Framework

GitHub user data

Do I need a backend?

JavaScript dependencies through web-jars

Towards a web application: HTML templates

Modular JavaScript through RequireJS

Bootstrapping the applications

Client-side program architecture

Drawing plots with NVD3

Summary

References

Pattern Matching and Extractors

Pattern matching in for comprehensions

Pattern matching internals

Extracting sequences

Summary

Reference

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Programming in data science

This book is not a book about data science. It is a book about how to use Scala, a programming language, for data science. So, where does programming come in when processing data?

Computers are involved at every step of the data science pipeline, but not necessarily in the same manner. The style of programs that we build will be drastically different if we are just writing throwaway scripts to explore data or trying to build a scalable application that pushes data through a well-understood pipeline to continuously deliver business intelligence.

Let's imagine that we work for a company making games for mobile phones in which you can purchase in-game benefits. The majority of users never buy anything, but a small fraction is likely to spend a lot of money. We want to build a model that recognizes big spenders based on their play patterns.

The first step is to explore data, find the right features, and build a model based on a subset of the data. In this exploration phase, we have a clear goal in mind but little idea of how to get there. We want a light, flexible language with strong libraries to get us a working model as soon as possible.

Once we have a working model, we need to deploy it on our gaming platform to analyze the usage patterns of all the current users. This is a very different problem: we have a relatively clear understanding of the goals of the program and of how to get there. The challenge comes in designing software that will scale out to handle all the users and be robust to future changes in usage patterns.

In practice, the type of software that we write typically lies on a spectrum ranging from a single throwaway script to production-level code that must be proof against future expansion and load increases. Before writing any code, the data scientist must understand where their software lies on this spectrum. Let's call this the permanence spectrum.

Scala for Data Science

By : Pascal Bugnion

Scala for Data Science

By: Pascal Bugnion

Overview of this book

Related Content you might be interested in

Current Title:

Scala for Data Science

Programming in data science