Book Image

Julia for Data Science

By : Anshul Joshi
2 (1)
Book Image

Julia for Data Science

2 (1)
By: Anshul Joshi

Overview of this book

Julia is a fast and high performing language that's perfectly suited to data science with a mature package ecosystem and is now feature complete. It is a good tool for a data science practitioner. There was a famous post at Harvard Business Review that Data Scientist is the sexiest job of the 21st century. (https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). This book will help you get familiarised with Julia's rich ecosystem, which is continuously evolving, allowing you to stay on top of your game. This book contains the essentials of data science and gives a high-level overview of advanced statistics and techniques. You will dive in and will work on generating insights by performing inferential statistics, and will reveal hidden patterns and trends using data mining. This has the practical coverage of statistics and machine learning. You will develop knowledge to build statistical models and machine learning systems in Julia with attractive visualizations. You will then delve into the world of Deep learning in Julia and will understand the framework, Mocha.jl with which you can create artificial neural networks and implement deep learning. This book addresses the challenges of real-world data science problems, including data cleaning, data preparation, inferential statistics, statistical modeling, building high-performance machine learning systems and creating effective visualizations using Julia.
Table of Contents (17 chapters)
Julia for Data Science
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface

Preface

Data Scientist: The Sexiest Job of the 21st Century, Harvard Business Review. And why Julia? A high level language with large scientific community and performance comparable to C, it is touted as next best language for data science. Using Julia, we can create statistical models, highly performant machine learning systems, and beautiful and attractive visualizations. 

What this book covers

Chapter 1, The Groundwork – Julia’s Environment, explains how to set up the Julia’s environment (Command Line(REPL) and Jupyter Notebook) and explains Julia’s ecosystem, why Julia is special, and package management. It also gives an introduction to parallel processing and multiple dispatch and explains how Julia is suited for data science.

Chapter 2, Data Munging, explains the need for and process of data preparation, also called data munging. Data munging refers to changing data from one state to other, in well-defined reversible steps. It is preparing data to be used for analytics and visualizations.

Chapter 3, Data Exploration, explains that statistics is the core of data science, shows that Julia provides various statistical functions. This chapter will give a high-level overview of statistics and will explain the techniques required to apply those statistical concepts to general problems using Julia’s statistical packages, such as Stats.jl and Distributions.jl.

Chapter 4, Deep Dive into Inferential Statistics, continues statistics is the core of the data science and is Julia provides various statistical functions. This chapter will give high level overview of advance statistics and then will explain the techniques to apply those statistical concepts on general problems using Julia’s statistical packages such as Stats.jl and Distributions.jl.

Chapter 5, Making Sense of Data Using Visualization, explains why data visualization is essential part of data science and how it makes communicating the results more effective and reaches out to larger audience. This chapter will go through the Vega, Asciiplot, and Gadfly packages of Julia, which are used for data visualization.

Chapter 6, Supervised Machine Learning, says "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E" – Tom M. Mitchell. Machine learning is a field of study that gives computers the ability to learn and enhance without being explicitly programmed. This chapter will explain that Julia is a high-level language with a great performance, and is nicely suited for machine learning. This chapter will focus on supervised machine learning algorithms such as Naive Bayes, regression, and decision trees.

Chapter 7, Unsupervised Machine Learning, explains that unsupervised learning is a little bit different and harder than supervised learning. The aim is to get the system to learn something but we don’t know what it will learn. This chapter will focus on unsupervised learning algorithms such as clustering.

Chapter 8, Creating Ensemble Models, explains that a group of people has the ability to take better decisions than a single individual, especially when each group member comes in with their own biases. This is also true for machine learning. This chapter will focus on a machine learning technique called ensemble learning, an example being random forest.

Chapter 9, Time Series, shows the capacity to demonstrate and perform decision modeling, and explains that examination is a crucial component of some real-world applications running from emergency medical treatment in intensive care units to military command and control frameworks. This chapter focuses on time series data and forecasting using Julia.

Chapter 10, Collaborative Filtering and Recommendation System, explains that every day we are confronted with decisions and choices. These can range from our clothes to the movies we watch or what to eat when we order online. We take decisions in business too. For instance, which stock should we invest in? What if decision making could be automated, and suitable recommendations could be given to us. This chapter focuses on recommendation systems and techniques such as collaborative filtering and association rule mining.

Chapter 11, Introduction to Deep Learning, explains that deep learning refers to a class of machine learning techniques that do unsupervised or supervised feature extraction and pattern analysis or classification by exploiting multiple layers of non-linear information processing. This chapter will introduce us to deep learning in Julia. Deep learning is a new branch of machine learning with one goal – Artificial Intelligence. We will also learn about Julia's framework, Julia's, Mocha.jl, with which we can implement deep learning.

What you need for this book

The reader will requires a system (64-bit recommended) having a fairly recent operating system (Linux, Windows 7+, and Mac OS) with a working Internet connection and privileges to install Julia, Git and various packages used in the book.

Who this book is for

The standard demographic is data analysts and aspiring data scientists with little to no grounding in the fundamentals of the Julia language, who are looking to explore how to conduct data science with Julia's ecosystem of packages. On top of this are competent Python or R users looking to leverage Julia to enhance the efficiency of their ability to conduct data science. A good background in statistics and computational mathematics is expected.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Julia also provides another function, summarystats()."

A block of code is set as follows:

ci(x::HypothesisTests.FisherExactTest) 
ci(x::HypothesisTests.FisherExactTest, alpha::Float64)
ci(x::HypothesisTests.TTest) 
ci(x::HypothesisTests.TTest, alpha::Float64)

Any command-line input or output is written as follows:

julia> Pkg.update() 
julia> Pkg.add("StatsBase")

New terms and important words are shown in bold.

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

  1. Log in or register to our website using your e-mail address and password.

  2. Hover the mouse pointer on the SUPPORT tab at the top.

  3. Click on Code Downloads & Errata.

  4. Enter the name of the book in the Search box.

  5. Select the book for which you're looking to download the code files.

  6. Choose from the drop-down menu where you purchased this book from.

  7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows

  • Zipeg / iZip / UnRarX for Mac

  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Julia-for-data-science. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/default/files/downloads/JuliaforDataScience_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.