In the last decade, data science has become a buzzword, with Harvard Business Review naming it the sexiest job of the 21st century. What is a data scientist? The answer was published in The Guardian (https://www.theguardian.com/careers/2015/jun/30/whats-a-data-scientist-and-how-do-i-become-one):
A data scientist takes raw data and marries it with analysis to make it accessible and more valuable for an organization. To do this, they need a unique blend of skills—a solid grounding in maths and algorithms and a good understanding of human behaviors, as well as knowledge of the industry they're working in, to put their findings into context. From here, they can unlock insights from the datasets and start to identify trends.
The technical skills of a data scientist are varied but, generally, they are good at programming, and have a very strong background in mathematics—especially statistics, skills in machine learning, and knowledge of big data. A data scientist is required to have in-depth understanding of the domain he/she is working in. Julia was designed for scientific and numerical computation. And with the advent of big data, there is a requirement to have a language that can work on huge amounts of data. Although we have Spark and MapReduce
(Hadoop) as processing engines that are generally used with Python, Scala, and Java, Julia with Intel's High Performance Analytics Toolkit can provide an alternative option. It may also be worth noting that Julia excels at parallel computing but is much easier to write and prototype than Spark/Hadoop.
One great feature of Julia is that it solves the 2-language problem. Generally, with Python and R, code that is doing most of the heavy workload is written in C/C++ and it is then called. This is not required with Julia, as it can perform comparably to C/C++. Therefore, complete code—including code that does heavy computations—can be written in Julia itself.
We mentioned the speed of Julia above, and that's what sets this language apart from traditional dynamically typed languages. Speed is its specialty. So, how fast can Julia be? The following micro-benchmark results were obtained on a single core (serial execution) on an Intel(R) Xeon(R) CPU E7-8850 2.00 GHz CPU with 1 TB of 1067 MHz DDR3 RAM running Linux:
Julia 0.4.0 | Python 3.4.3 | R 3.2.2 | MATLAB R2015b | Go go1.5 | Java 1.8.0_45 | |
| 2.11 | 77.76 | 533.52 | 26.89 | 1.86 | 1.21 |
| 1.45 | 17.02 | 45.73 | 802.52 | 1.20 | 3.35 |
| 1.15 | 32.89 | 264.54 | 4.92 | 1.29 | 2.60 |
| 0.79 | 15.32 | 53.16 | 7.58 | 1.11 | 1.35 |
| 1.00 | 21.99 | 9.56 | 1.00 | 1.00 | 1.00 |
| 1.66 | 17.93 | 14.56 | 14.52 | 2.96 | 3.92 |
| 1.02 | 1.14 | 1.57 | 1.12 | 1.42 | 2.36 |
These benchmark times are relative to C (smaller is better, C performance = 1.0). Benchmarks can be misleading and are not always true. Good coding practices need to be followed and exactly identical conditions are required to measure them side by side. Julia has been quite open about how it measured these benchmarks and the code is available at https://github.com/JuliaLang/julia/tree/master/test/perf/micro.
- Julia is significantly faster than Python. There is a huge difference in performance in the benchmarks. However, some libraries for numerical computation available to Python are written in C, and here it performs nearly equivalent to Julia.
- R was specifically designed for statisticians. It has a huge set of libraries for statistics and numerical computation and is available for free. It used to be the language of choice for data scientists (now Python is preferred). R is single-threaded and is a lot slower than Julia.
- MATLAB is not a free product. It comes with a paid license (students may get discounts). It is used by statisticians and academicians for some specific use cases. The above benchmarks run a lot slower on MATLAB.
- Go is designed from scratch for system programming. It was created by Google and the source code is available on GitHub, where it is actively developed. Go performs really well on these benchmarks, but it is not designed for numerical and scientific computing.
- Java performs well. It beats Julia in some benchmarks and Julia beats it in others. But we need to consider the development time associated with it. Julia is designed in a way that it can be used even for rapid prototyping. That makes it unique.
Julia is therefore well-suited to data science problems. Its ecosystem may not be as comprehensive as other languages right now, but it is growing at a great pace.