Book Image

Scientific Computing with Scala

By : Vytautas Jancauskas
Book Image

Scientific Computing with Scala

By: Vytautas Jancauskas

Overview of this book

Scala is a statically typed, Java Virtual Machine (JVM)-based language with strong support for functional programming. There exist libraries for Scala that cover a range of common scientific computing tasks – from linear algebra and numerical algorithms to convenient and safe parallelization to powerful plotting facilities. Learning to use these to perform common scientific tasks will allow you to write programs that are both fast and easy to write and maintain. We will start by discussing the advantages of using Scala over other scientific computing platforms. You will discover Scala packages that provide the functionality you have come to expect when writing scientific software. We will explore using Scala's Breeze library for linear algebra, optimization, and signal processing. We will then proceed to the Saddle library for data analysis. If you have experience in R or with Python's popular pandas library you will learn how to translate those skills to Saddle. If you are new to data analysis, you will learn basic concepts of Saddle as well. Well will explore the numerical computing environment called ScalaLab. It comes bundled with a lot of scientific software readily available. We will use it for interactive computing, data analysis, and visualization. In the following chapters, we will explore using Scala's powerful parallel collections for safe and convenient parallel programming. Topics such as the Akka concurrency framework will be covered. Finally, you will learn about multivariate data visualization and how to produce professional-looking plots in Scala easily. After reading the book, you should have more than enough information on how to start using Scala as your scientific computing platform
Table of Contents (16 chapters)
Scientific Computing with Scala
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface
Index

Why Scala for scientific computing?


This book assumes a basic familiarity with the Scala language. If you do not know Scala but are interested in writing your scientific code in it, you should consider getting a companion book that teaches the basics of the language. Any nontrivial topics will be explained, but we do not provide an introduction to any of the basic Scala programming concepts here. We will assume that you have Scala installed and you have your favorite IDE, or at least your favorite text editor setup to write Scala programs. If not, we introduce using Emacs as a Scala IDE. It would also be of benefit to you if you are already familiar with other popular scientific computing systems.

A lot of the topics in the book will be far easier to understand and put to good use if you already know how to do the things in question in other systems: we will be covering functionality that is similar to the MATLAB interactive computing environment, NumPy scientific computing package for the Python programming language, pandas data analysis library for Python, statistical computing language R, and similar software. After reading the book, you will hopefully be able to get all the functionality of the aforementioned software from Scala and more!

What are the advantages compared to C/C++/Java?

One obvious advantage to using Scala is that it is a Java virtual machine language. It is one among several, including Clojure, Groovy, Jython, JRuby, and of course Java itself. This means that, after writing your program, you compile it to a Java virtual machine bytecode that is then executed by the Java virtual machine interpreter. Think of the bytecode as the machine code of a virtual computer. When you write programs in C/C++ and similar compiled languages, they are translated straight to machine code that you then execute directly on your computer's processor. If you want to then run it on a different processor, you would have to recompile your program. Since the Java virtual machine runs on many different computer architectures, this is no longer necessary.

After you compile your program, the resulting bytecode can then be run on any system that can run a Java virtual machine. Therefore, your compiled code is portable. This is one advantage to using Scala as opposed to C/C++. Why not just write your program in Java then? Java is designed for writing large software in teams consisting of many programmers of varying skill levels. As such, it is an incredibly bureaucratic language. Quickly realizing your ideas in Java is difficult. This is because, even in the simplest cases, there is a lot of boilerplate code involved. The language is designed to slow you down and force you to do things by the book, designing the software before you start writing it.

Scala has the additional advantage of interactivity. It is easy to write and run small Scala programs, test them interactively, make changes, and then repeat. This is essential for scientific code, where a lot of the time you are testing an idea out and you want to do it quickly so that, if it does not work out, you can move on to another idea. As an added bonus, you can use any of the many Java libraries from Scala with ease! Since Java is very widely used in the industry, it contains a plethora of libraries for various purposes. These can be accessed from any JVM-based language. Most often, new functional programming languages don't share this advantage (since they are not JVM-based).

Scala also has strong support for functional programming. Functional programming treats programming as the evaluation of mathematical functions and avoids changing variable state explicitly. This leads to a declarative programming style where (ideally) the intention, rather than an explicit procedure, is given by the programmer. This (partially) eliminates the need for side effects—changes in the program state. Eliminating side effects leads to a programming style that is less error-prone and makes it easier to understand and predict program behavior. This has important consequences such as the easy and automatic parallelization of programs, program verification, and so on.

Parallelization is becoming more important with the increasing number of CPU cores in computers. Parallel programming in imperative languages involves a lot of very subtle issues that few programmers fully understand. So, there is hope that functional programming can help in this regard.

Pure functional programming often feels restrictive to programmers who are used to the more common imperative style. As a consequence of this, Scala supports both programming styles. You can start programming more or less as you would in Java and slowly incorporate more advanced features of the language into your programs. This removes a lot of the seemingly intimidating nature of functional programming since the concepts can be incorporated when needed and where they fit best. This may annoy functional programming purists, but is great for the more pragmatically minded.

Here is a small code segment that compares Java and Scala code, which takes an array, squares the elements, and adds them together. This is not an uncommon pattern (in one form or another) in numerical code. This will serve as a small example of Scala's conciseness compared to Java. This is by no means a proof of how Scala is more concise compared to Java, but the perception that it is is very often true.

Scala code:

val arr = (0 until 10).toArray
arr.map(x => x * x).reduce(_+_)

Java code:

int arr[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
int result = 0;
for (int a: arr) {
  result += a * a;
}

In the Scala version, we think descriptively in functions that are applied to the array to get the result we want. These functions take other functions as arguments. In the Java version, we think imperatively—what actions have to be taken to get the result we need?

What are the advantages compared to MATLAB/Python/R?

You may object that a lot of what has been said earlier can also be said of languages such as Python, MATLAB, R, or any other interpreted language. They all support different programming styles; when you write your program in, say Python, it will run on any platform that can run a Python interpreter. So, why not just use those? Well, one answer is execution speed. Many will object, and have objected before, that speed is not their primary concern when writing scientific software. That is true but only up to a point. It isn't easy to convince people of this, but I have grown convinced of this myself. The usual workflow in languages using dynamic typing is outlined here:

  • Prototype your numerically intensive code in your favorite language.

  • Note that it will take 3 years to complete a single run of the program in its current state.

  • Use a profiler to identify bottlenecks. The profiler indicates that everything is a bottleneck.

  • Rewrite the most performance critical parts in C or C++.

  • Battle the foreign function interface or whatever other method your language provides for calling C/C++ functions.

  • End up with a C/C++ program wrapped in a couple of lines of your favorite programming language.

The aforementioned is obviously a caricature. However, the important point is that the process described here adds at least two extra nontrivial stages to the already complicated process of writing software; not any software, but software you usually have no clear specifications for. On top of that, you are often not sure if what you are doing is sensible (which is the case for most scientific software in my experience).

The two extra stages are using the profiler, which is a tool that identifies portions of the code your program spends time in, and embedding code written in C/C++ (or some other statically typed language) in your program. People often will use a profiler on programs written in languages such as C++ or Java as well. But the reason for using it is usually that you want to squeeze the last few drops of performance out of it and not just make the software usable. The result of this is that all of the advantages of your nice dynamically typed programming language are reduced to nothing.

These advantages are supposed to be the speed of development and being able to make changes quickly. However, you end up spending time profiling software, rewriting nice bits of your code into ugly efficient bits, and finally just writing most of the thing in C. None of this sounds or is fun. There are workarounds to this, but why would you be content with this procedure? Why can't you just write your program and have it behave sensibly from the very start? Some will object that you should not be using a programming language such as Python for performance-critical code. This is true. However, most people learn one language, get used to its libraries, and will tend to write all their code in it.

You may very well end up with something that is not usable without a lot of extra effort this way. Using languages designed for the speed of execution (so called systems programming languages) is certainly possible. They, however, have many other disadvantages. The primary disadvantage is that prototyping is very hard in them. So is realizing your ideas quickly.

So, how does Scala help? Why is it faster than, say, Python; and by how much? Where does dynamic versus static typing come in to this? A simple way to see how much faster one language is when compared to another is to use some kind of benchmark suite. An interesting comparison is provided in the following website:

http://benchmarksgame.alioth.debian.org/

You can visit the website to make sure what is said here is true. In it, several different algorithms implemented in each different language are compared in terms of execution speed, memory use, and so on. Java is evaluated against C in the results. It can be seen that Java is comparable to C in terms of execution speed. Even though it is slower, it is usually not slower by much. In two cases out of eleven, it is actually faster. The comparisons that are more interesting are between Python 3 and Java, and Scala and Java.

Python is a very popular language in scientific computing. So, how does it stack up? In five cases out of eleven, it is actually around 40 times slower than Java. This is a lot. If your calculations take 10 minutes with Java, you would have to wait almost 7 hours, if you wrote them in Python (if we assume a linear relationship—a fair assumption in this case I feel). Scala is much better in this regard.

In most cases, its speed is compared to that of Java. This is good news, since Java is a fairly fast language. This means that you can write your code in the clearest way possible, and it will still work fast. If you want to squeeze some extra juice out of it, you can always profile it using one of the profiling tools we will discuss later. Would the same apply in the case of MATLAB and R? Well, the website does not benchmark those languages but one would imagine so. Those are both dynamic languages as well.

So what is a dynamic language and a static language? Why is one slower than the other? What other advantages or disadvantages are there in using one over the other? The simplest way to describe it is this: in a dynamically typed language, variables are bound to objects only, and in a statically typed language variables are bound to both object and type.

When you program in a dynamically typed language, you can usually assign anything you want to any variable. In a statically typed language, a variable's type is declared in advance (or in the case of Scala can be inferred from context).

In practice, it follows from this that the compiler can optimize the code much better, since it can use optimizations specific for that type. For example, this happens with Java's numeric types, where they are compiled to JVM arithmetic opcodes instead of more general method calls. In a dynamic language, the type often has to be determined at runtime and there are often other checks as well. All of this consumes CPU cycles.

Furthermore, calling functions and methods as well as accessing object attributes is much faster in static languages than in dynamic ones. A compiler is also capable of catching type errors. In a dynamic language such as Python, nothing prevents you from calling any method with any arguments on any object. This leads to problems since these errors are only caught at runtime.

It can easily happen that your program will fail near the end of a 2-hour run just because you forgot that you made changes to a method's argument list. In statically typed languages, these types of error will be caught at compile time. As an added bonus, good IDEs are easier to implement for static languages than for dynamic ones. This is because the code itself provides a lot of useful information that the IDE can use to provide functionality you expect from a modern IDE. This includes autocompletion, listing available methods for an object, and so on.

Let's recap what was said so far—the main advantage to using Scala for scientific code is that you can write what you mean, and it will usually work fast. There is no need for elaborate and often wonky strategies employed to optimize code in other languages. This will result in readable, easy-to-understand code, and you will not lose any of the advantages of dynamic languages.

Scala is quick to develop in and easy to understand. I think these are the main reasons why you should consider it as your main scientific computing language. This is especially true if you write your own numerical code or code that is generally fairly complex and where you can't rely on fast libraries to provide most of the functionality.

Scala does parallelism well

Parallel execution of code is very important in scientific computing. Often scientists want to model a certain physical phenomenon. Simulations of the physical world take a long time. Since the primary method of increasing computer performance is adding more CPU cores, parallelizing your algorithms is becoming the main way of reducing the amount of time it takes your program to do the things you want of it.

Another aspect of this is running code on supercomputers where algorithms are split up into several tasks that usually communicate by passing messages to each other. Programs written in imperative style are generally tricky to parallelize. Scala has strong support for functional programming. In general, the declarative nature of programs written in functional programming languages makes them easier to parallelize. The main reason for this is that functional programming languages avoid side effects.

Side effects are explicit changes to state, such as assigning to variables, writing to files, or devices. Avoiding side effects avoids common pitfalls in parallel programming such as, race conditions, deadlocks, and so on. While no technique avoids these problems completely, declarative programming languages are much better suited to handle these issues.

Scala supports parallel collections that make it easy to carry out concurrent calculations. Another option is to use the Akka toolkit that supports several ways of carrying out calculations in parallel. Both these options will be discussed in detail in the following chapters.

Any downsides?

There is currently one big downside to using Scala for scientific computing, and many would consider it a crucial one; there currently aren't many well-established packages for scientific computing available for it. While the core language is solid, without an established infrastructure of libraries, there is only so much you can do on your own.

The situation in this regard is a lot better in other systems. This is especially true of Python, which has more scientific computing libraries than you can shake a stick at. But, it is also true of MATLAB and others. Thus, is the nature of the vicious cycle of popularity—systems are popular because they have many libraries for doing different things, and they have many libraries because they are popular.

Scala isn't yet an established language in this regard. I believe, however, that it deserves to be. And, maybe this book will help it towards that goal. With enough people using Scala for scientific computing, we will eventually see more libraries developed and existing ones being better supported and more actively maintained.