Book Image

Mastering Julia

Book Image

Mastering Julia

Overview of this book

Table of Contents (17 chapters)
Mastering Julia
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Introduction


Julia was first released to the world in February 2012 after a couple of years of development at the Massachusetts Institute of Technology (MIT).

All the principal developers—Jeff Bezanson, Stefan Karpinski, Viral Shah, and Alan Edelman—still maintain active roles in the language and are responsible for the core, but also have authored and contributed to many of the packages.

The language is open source, so all is available to view. There is a small amount of C/C++ code plus some Lisp and Scheme, but much of core is (very well) written in Julia itself and may be perused at your leisure. If you wish to write exemplary Julia code, this is a good place to go in order to seek inspiration. Towards the end of this chapter, we will have a quick run-down of the Julia source tree as part of exploring the Julia environment.

Julia is often compared with programming languages such as Python, R, and MATLAB. It is important to realize that Python and R have been around since the mid-1990s and MATLAB since 1984. Since MATLAB is proprietary (® MathWorks), there are a few clones, particularly GNU Octave, which again dates from the same era as Python and R. Just how far the language has come is a tribute to the original developers and the many enthusiastic ones who have followed on. Julia uses GitHub as both for a repository for its source and for the registered packages. While it is useful to have Git installed on your computer, normal interaction is largely hidden from the user since Julia incorporates a working version of Git, wrapped up in a package manager (Pkg), which can be called from the console While Julia has no simple built-in graphics, there are several different graphics packages and I will be devoting a chapter later particularly to these.

Philosophy

Julia was designed with scientific computing in mind. The developers all tell us that they came with a wide array of programming skills—Lisp, Python, Ruby, R, and MATLAB. Some like myself even claim to originate as Perl hackers. However, all need a fast compiled language in their armory such as C or Fortran as the current languages listed previously are pitifully slow.

So, to quote the development team:

"We want a language that's open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that's homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.

(Did we mention it should be as fast as C?)"

http://julialang.org/blog/2012/02/why-we-created-julia

With the introduction of the Low-Level Virtual Machine (LLVM) compilation, it has become possible to achieve this goal and to design a language from the outset, which makes the two-language approach largely redundant.

Julia was designed as a language similar to other scripting languages and so should be easy to learn for anyone familiar to Python, R, and MATLAB. It is syntactically closest to MATLAB, but it is important to note that it is not a drop-in clone. There are many important differences, which we will look at later.

It is important not to be too overwhelmed by considering Julia as a challenger to Python and R. In fact, we will illustrate instances where the languages are used to complement each other. Certainly, Julia was not conceived as such, and there are certain things that Julia does which makes it ideal for use in the scientific community.

Role in data science and big data

Julia was initially designed with scientific computing in mind. Although the term "data science" was coined as early as the 1970s, it was only given prominence in 2001, in an article by William S. Cleveland, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. Almost in parallel with the development of Julia has been the growth in data science and the demand for data science practitioners.

What is data science?

The following might be one definition:

Data science is the study of the generalizable extraction of knowledge from data. It incorporates varying elements and builds on techniques and theories from many fields, including signal processing, mathematics, probability models, machine learning, statistical learning, computer programming, data engineering, pattern recognition, learning, visualization, uncertainty modeling, data warehousing, and high-performance computing with the goal of extracting meaning from data and creating data products.

If this sounds familiar, then it should be. These were the precise goals laid out at the onset of the design of Julia. To fill the void, most data scientists have turned to Python and to a lesser extent, to R. One principal cause in the growth of the popularity of Python and R can be traced directly to the interest in data science.

So, what we set out to achieve in this book is to show you as a budding data scientist, why you should consider using Julia, and if convinced, then how to do it.

Along with data science, the other "new kids on the block" are big data and the cloud. Big data was originally the realm of Java largely because of the uptake of the Hadoop/HDFS framework, which, being written in Java, made it convenient to program MapReduce algorithms in it or any language, which runs on the JVM. This leads to an obscene amount of bloated boilerplate coding.

However, here, with the introduction of YARN and Hadoop stream processing, the paradigm of processing big data is opened up to a wider variety of approaches. Python is beginning to be considered an alternative to Java, but upon inspection, Julia makes an excellent candidate in this category too.

Comparison with other languages

Julia has the reputation for speed. The home page of the main Julia website, as of July 2014, includes references to benchmarks. The following table shows benchmark times relative to C (smaller is better, C performance = 1.0):

 

Fortran

Julia

Python

R

MATLAB

Octave

Mathe matica

Java Script

Go

fib

0.26

0.91

30.37

411.31

1992.0

3211.81

64.46

2.18

1.0

mandel

0.86

0.85

14.19

106.97

64.58

316.95

6.07

3.49

2.36

pi_sum

0.80

1.00

16.33

15.42

1.29

237.41

1.32

0.84

1.41

rand_mat_stat

0.64

1.66

13.52

10.84

6.61

14.98

4.52

3.28

8.12

rand_mat_mul

0.96

1.01

3.41

3.98

1.10

3.41

1.16

14.60

8.51

Benchmarks can be notoriously misleading; indeed, to paraphrase the common saying: there are lies, damned lies, and benchmarks.

The Julia site does its best to lay down the parameters for these tests by providing details of the workstation used—processor type, CPU clock speed, amount of RAM, and so on—and the operating system deployed. For each test, the version of the software is provided plus any external packages or libraries; for example, for the rand_mat test, Python uses NumPy, and C, Fortran, and Julia use OpenBLAS.

Julia provides a website for checking its performance: http://speed.julialang.org.

The source code for all the tests is available on GitHub. This is not just the Julia code but also that used in C, MATLAB, Python, and so on. Indeed, extra language examples are being added, and you will find benchmarks to try in Scala and Lua too:

https://Github.com/JuliaLang/julia/tree/master/test/perf/micro.

This table is useful in another respect too, as it lists all the major comparative languages of Julia. No real surprises here, except perhaps the range of execution times.

  • Python: This has become the de facto data science language, and the range of modules available is overwhelming. Both version 2 and version 3 are in common usage; the latter is NOT a superset of the former and is around 10% slower. In general, Julia is an order of magnitude faster than Python, so often when the established Python code is compiled or rewritten in C.

  • R: Started life as an open source version of the commercial S+ statistics package (® TIBCO Software Inc.), but has largely superseded it for use in statistics projects and has a large set of contributed packages. It is single-threaded, which accounts for the disappointing execution times and parallelization is not straightforward. R has very good graphics and data visualization packages.

  • MATLAB/Octave: MATLAB is a commercial product (® MathWorks) for matrix operations, hence, the reasonable times for the last two benchmarks, but others are very long. GNU Octave is a free MATLAB clone. It has been designed for compatibility rather than efficiency, which accounts for the execution times being even longer.

  • Mathematica: Another commercial product (® Wolfram Research) for general-purpose mathematical problems. There is no obvious clone although the Sage framework is open source and uses Python as its computation engine, so its timings are similar to Python.

  • JavaScript and Go: These are linked together since they both use the Google V8 engine. V8 compiles to native machine code before executing it; hence, the excellent performance timings but both languages are more targeted at web-based applications.

So, Julia would seem to be an ideal language for tackling data science problems. It's important to recognize that many of the built-in functions in R and Python are not implemented natively but are written in C. Julia performs roughly as well as C, so Julia won't do any better than R or Python if most of the work you do in R or Python calls built-in functions without performing any explicit iteration or recursion.

However, when you start doing custom work, Julia will come into its own. It is the perfect language for advanced users of R or Python, who are trying to build advanced tools inside of these languages. The alternative to Julia is typically resorting to C; R offers this through Rcpp, and Python offers it through Cython.

There is a possibility of more cooperation between Julia with R and/or Python than competition, although this is not the common view.

Features

The Julia programming language is free and open source (MIT licensed), and the source is available on GitHub.

To the veteran programmer, it has looks and feels similar to MATLAB. Blocks created by the for, while, and if statements are all terminated by end rather than by endfor, endwhile, and endif or by using the familiar {} style syntax. However, it is not a MATLAB clone, and sources written for MATLAB will not run on Julia.

The following are some of Julia's features:

  • Designed for parallelism and distributed computation (multicore and cluster)

  • C functions called directly (no wrappers or special APIs needed)

  • Powerful shell-like capabilities for managing other processes

  • Lisp-like macros and other meta-programming facilities

  • User-defined types are as fast and compact as built-ins

  • LLVM-based, just-in-time (JIT) compiler that allows Julia to approach and often match the performance of C/C++

  • An extensive mathematical function library (written in Julia)

  • Integrated mature, best-of-breed C and Fortran libraries for linear algebra, random number generation, Fast Fourier Transform (FFT), and string processing

Julia's core is implemented in C and C++, and its parser in Scheme; the LLVM compiler framework is used for the JIT generation of machine code.

The standard library is written in Julia itself by using Node.js's libuv library for efficient, cross-platform I/O.

Julia has a rich language of types for constructing and describing objects that can also optionally be used to make type declarations. It has the ability to define function behavior across many combinations of argument types via a multiple dispatch, which is the key cornerstone of language design.

Julia can utilize code in other programming languages by directly calling routines written in C or Fortran and stored in shared libraries or DLLs. This is a feature of the language syntax and will be discussed in detail later.

In addition, it is possible to interact with Python via PyCall and this is used in the implementation of the IJulia programming environment.