Book Image

Julia for Data Science

By : Anshul Joshi
2 (1)
Book Image

Julia for Data Science

2 (1)
By: Anshul Joshi

Overview of this book

Julia is a fast and high performing language that's perfectly suited to data science with a mature package ecosystem and is now feature complete. It is a good tool for a data science practitioner. There was a famous post at Harvard Business Review that Data Scientist is the sexiest job of the 21st century. (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). This book will help you get familiarised with Julia's rich ecosystem, which is continuously evolving, allowing you to stay on top of your game. This book contains the essentials of data science and gives a high-level overview of advanced statistics and techniques. You will dive in and will work on generating insights by performing inferential statistics, and will reveal hidden patterns and trends using data mining. This has the practical coverage of statistics and machine learning. You will develop knowledge to build statistical models and machine learning systems in Julia with attractive visualizations. You will then delve into the world of Deep learning in Julia and will understand the framework, Mocha.jl with which you can create artificial neural networks and implement deep learning. This book addresses the challenges of real-world data science problems, including data cleaning, data preparation, inferential statistics, statistical modeling, building high-performance machine learning systems and creating effective visualizations using Julia.
Table of Contents (17 chapters)
Julia for Data Science
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface

Package management


Julia provides a built-in package manager. Using Pkg we can install libraries written in Julia. For external libraries, we can also compile them from their source or use the standard package manager of the operating system. A list of registered packages is maintained at http://pkg.julialang.org.

Pkg is provided in the base installation. The Pkg module contains all the package manager commands.

Pkg.status() – package status

The Pkg.status() is a function that prints out a list of currently installed packages with a summary. This is handy when you need to know if the package you want to use is installed or not.

When the Pkg command is run for the first time, the package directory is automatically created. It is required by the command that the Pkg.status() returns a valid list of the packages installed. The list of packages given by the Pkg.status() are of registered versions which are managed by Pkg.

Pkg.installed() can also be used to return a list of all the installed packages with their versions.

Pkg.add() – adding packages

Julia's package manager is declarative and intelligent. You only have to tell it what you want and it will figure out what version to install and will resolve dependencies if there are any. Therefore, we only need to add the list of requirements that we want and it resolves which packages and their versions to install.

The ~/.julia/v0.4/REQUIRE file contains the package requirements. We can open it using a text editor such as vi or atom, or use Pkg.edit() in Julia's shell to edit this file. After editing the file, run Pkg.resolve() to install or remove the packages.

We can also use Pkg.add(package_name) to add packages and Pkg.rm(package_name) to remove packages. Earlier, we used Pkg.add("IJulia")  to install the IJulia package.

When we don't want to have a package installed on our system anymore, Pkg.rm() is used for removing the requirement from the REQUIRE file. Similar to Pkg.add(), Pkg.rm() first removes the requirement of the package from the REQUIRE file and then updates the list of installed packages by running Pkg.resolve() to match.

Working with unregistered packages

Frequently, we would like to be able to use packages created by our team members or someone who has published on Git but they are not in the registered packages of Pkg. Julia allows us to do that by using a clone. Julia packages are hosted on Git repositories and can be cloned using mechanisms supported by Git. The index of registered packages is maintained at METADATA.jl. For unofficial packages, we can use the following:

Pkg.clone("git://example.com/path/unofficialPackage/Package.jl.git") 

Sometimes unregistered packages have dependencies that require fulfilling before use. If that is the scenario, a REQUIRE file is needed at the top of the source tree of the unregistered package. The dependencies of the unregistered packages on the registered packages are determined by this REQUIRE file. When we run Pkg.clone(url), these dependencies are automatically installed.

Pkg.update() – package update

It's good to have updated packages.  Julia, which is under active development, has its packages frequently updated and new functionalities are added.

To update all of the packages, type the following:

Pkg.update() 

Under the hood, new changes are pulled into the METADATA file in the directory located at ~/.julia/v0.4/ and it checks for any new registered package versions which may have been published since the last update. If there are new registered package versions, Pkg.update() attempts to update the packages which are not dirty and are checked out on a branch. This update process satisfies the top-level requirements by computing the optimal set of package versions to be installed. The packages with specific versions that must be installed are defined in the REQUIRE file in Julia's directory (~/.julia/v0.4/).

METADATA repository

Registered packages are downloaded and installed using the official METADATA.jl repository. A different METADATA repository location can also be provided if required:

julia> Pkg.init("https://julia.customrepo.com/METADATA.jl.git", "branch") 

Developing packages

Julia allows us to view the source code and as it is tracked by Git, the full development history of all the installed packages is available. We can also make our desired changes and commit to our own repository, or do bug fixes and contribute enhancements upstream.

You may also want to create your own packages and publish them at some point in time. Julia's package manager allows you to do that too.

It is a requirement that Git is installed on the system and the developer needs an account at their hosting provider of choice (GitHub, Bitbucket, and so on). Having the ability to communicate over SSH is preferred—to enable that, upload your public ssh-key to your hosting provider.

Creating a new package

It is preferable to have the REQUIRE file in the package repository. This should have the bare minimum of a description of the Julia version.

For example, if we would like to create a new Julia package called HelloWorld we would have the following:

Pkg.generate("HelloWorld", "MIT") 

Here, HelloWorld is the package that we want to create and MIT is the license that our package will have. The license should be known to the package generator.

This will create a directory as follows: ~/.julia/v0.4/HelloWorld. The directory that is created is initialized as a Git repository. Also, all the files required by the package are kept in this directory. This directory is then committed to the repository.

This can now be pushed to the remote repository for the world to use.