Book Image

Scala for Data Science

By : Pascal Bugnion
Book Image

Scala for Data Science

By: Pascal Bugnion

Overview of this book

Scala is a multi-paradigm programming language (it supports both object-oriented and functional programming) and scripting language used to build applications for the JVM. Languages such as R, Python, Java, and so on are mostly used for data science. It is particularly good at analyzing large sets of data without any significant impact on performance and thus Scala is being adopted by many developers and data scientists. Data scientists might be aware that building applications that are truly scalable is hard. Scala, with its powerful functional libraries for interacting with databases and building scalable frameworks will give you the tools to construct robust data pipelines. This book will introduce you to the libraries for ingesting, storing, manipulating, processing, and visualizing data in Scala. Packed with real-world examples and interesting data sets, this book will teach you to ingest data from flat files and web APIs and store it in a SQL or NoSQL database. It will show you how to design scalable architectures to process and modelling your data, starting from simple concurrency constructs such as parallel collections and futures, through to actor systems and Apache Spark. As well as Scala’s emphasis on functional structures and immutability, you will learn how to use the right parallel construct for the job at hand, minimizing development time without compromising scalability. Finally, you will learn how to build beautiful interactive visualizations using web frameworks. This book gives tutorials on some of the most common Scala libraries for data science, allowing you to quickly get up to speed with building data science and data engineering solutions.
Table of Contents (22 chapters)
Scala for Data Science
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Appendix A. Pattern Matching and Extractors

Pattern matching is a powerful tool for control flow in Scala. It is often underused and under-estimated by people coming to Scala from imperative languages.

Let's start with a few examples of pattern matching before diving into the theory. We start by defining a tuple:

scala> val names = ("Pascal", "Bugnion")
names: (String, String) = (Pascal,Bugnion)

We can use pattern matching to extract the elements of this tuple and bind them to variables:

scala> val (firstName, lastName) = names
firstName: String = Pascal
lastName: String = Bugnion

We just extracted the two elements of the names tuple, binding them to the variables firstName and lastName. Notice how the left-hand side defines a pattern that the right-hand side must match: we are declaring that the variable names must be a two-element tuple. To make the pattern more specific, we could also have specified the expected types of the elements in the tuple:

scala> val (firstName:String, lastName:String) = names
firstName: String = Pascal
lastName: String = Bugnion

What happens if the pattern on the left-hand side does not match the right-hand side?

scala> val (firstName, middleName, lastName) = names
<console>:13: error: constructor cannot be instantiated to expected type;
found   : (T1, T2, T3)
required: (String, String)
   val (firstName, middleName, lastName) = names

This results in a compile error. Other types of pattern matching failures result in runtime errors.

Pattern matching is very expressive. To achieve the same behavior without pattern matching, you would have to do the following explicitly:

  • Verify that the variable names is a two-element tuple

  • Extract the first element and bind it to firstName

  • Extract the second element and bind it to lastName

If we expect certain elements in the tuple to have specific values, we can verify this as part of the pattern match. For instance, we can verify that the first element of the names tuple matches "Pascal":

scala> val ("Pascal", lastName) = names
lastName: String = Bugnion

Besides tuples, we can also match on Scala collections:

scala> val point = Array(1, 2, 3)
point: Array[Int] = Array(1, 2, 3)

scala> val Array(x, y, z) = point
x: Int = 1
y: Int = 2
z: Int = 3

Notice the similarity between this pattern matching and array construction:

scala> val point = Array(x, y, z)
point: Array[Int] = Array(1, 2, 3)

Syntactically, Scala expresses pattern matching as the reverse process to instance construction. We can think of pattern matching as the deconstruction of an object, binding the object's constituent parts to variables.

When matching against collections, one is sometimes only interested in matching the first element, or the first few elements, and discarding the rest of the collection, whatever its length. The operator _* will match against any number of elements:

scala> val Array(x, _*) = point
x: Int = 1

By default, the part of the pattern matched by the _* operator is not bound to a variable. We can capture it as follows:

scala> val Array(x, xs @ _*) = point
x: Int = 1
xs: Seq[Int] = Vector(2, 3)

Besides tuples and collections, we can also match against case classes. Let's start by defining a case representing a name:

scala> case class Name(first: String, last: String)
defined class Name

scala> val name = Name("Martin", "Odersky")
name: Name = Name(Martin,Odersky)

We can match against instances of Name in much the same way we matched against tuples:

scala> val Name(firstName, lastName) = name
firstName: String = Martin
lastName: String = Odersky

All these patterns can also be used in match statements:

scala> def greet(name:Name) = name match {
  case Name("Martin", "Odersky") => "An honor to meet you"
  case Name(first, "Bugnion") => "Wow! A family member!"
  case Name(first, last) => s"Hello, $first"
}
greet: (name: Name)String