Hands-On Data Science with Anaconda

By : Yuxing Yan, James Yan

Hands-On Data Science with Anaconda

By: Yuxing Yan, James Yan

Overview of this book

Anaconda is an open source platform that brings together the best tools for data science professionals with more than 100 popular packages supporting Python, Scala, and R languages. Hands-On Data Science with Anaconda gets you started with Anaconda and demonstrates how you can use it to perform data science operations in the real world. The book begins with setting up the environment for Anaconda platform in order to make it accessible for tools and frameworks such as Jupyter, pandas, matplotlib, Python, R, Julia, and more. You’ll walk through package manager Conda, through which you can automatically manage all packages including cross-language dependencies, and work across Linux, macOS, and Windows. You’ll explore all the essentials of data science and linear algebra to perform data science tasks using packages such as SciPy, contrastive, scikit-learn, Rattle, and Rmixmod. Once you’re accustomed to all this, you’ll start with operations in data science such as cleaning, sorting, and data classification. You’ll move on to learning how to perform tasks such as clustering, regression, prediction, and building machine learning models and optimizing them. In addition to this, you’ll learn how to visualize data using the packages available for Julia, Python, and R.

Preface

Who this book is for

What this book covers

To get the most out of this book

Get in touch

Free Chapter

Ecosystem of Anaconda

Summary

Review questions and exercises

Anaconda Installation

Installing Anaconda

Testing Python

Using IPython

Using Python via Jupyter

Introducing Spyder

Installing R via Conda

Installing Julia and linking it to Jupyter

Installing Octave and linking it to Jupyter

Finding help

Summary

Review questions and exercises

Data Basics

Sources of data

UCI machine learning

Introduction to the Python pandas package

Several ways to input data

Introduction to the Quandl data delivery platform

Dealing with missing data

Data sorting

Introduction to the cbsodata Python package

Introduction to the datadotworld Python package

Introduction to the haven and foreign R packages

Introduction to the dslabs R package

Generating Python datasets

Generating R datasets

Summary

Review questions and exercises

Data Visualization

Importance of data visualization

Data visualization in R

Data visualization in Python

Data visualization in Julia

Drawing simple graphs

Visualization packages for R

Visualization packages for Python

Visualization packages for Julia

Dynamic visualization

Summary

Review questions and exercises

Statistical Modeling in Anaconda

Introduction to linear models

Running a linear regression in R, Python, Julia, and Octave

Critical value and the decision rule

F-test, critical value, and the decision rule

Dealing with missing data

Detecting outliers and treatments

Several multivariate linear models

Collinearity and its solution

A model's performance measure

Summary

Review questions and exercises

Managing Packages

Introduction to packages, modules, or toolboxes

Two examples of using packages

Finding all R packages

Finding all Python packages

Finding all Julia packages

Finding all Octave packages

Task views for R

Finding manuals

Package dependencies

Package management in R

Package management in Python

Package management in Julia

Package management in Octave

Conda – the package manager

Creating a set of programs in R and Python

Finding environmental variables

Summary

Review questions and exercises

Optimization in Anaconda

Why optimization is important

General issues for optimization problems

Quadratic optimization

Example #1 – stock portfolio optimization

Example #2 – optimal tax policy

Packages for optimization in R

Packages for optimization in Python

Packages for optimization in Octave

Packages for optimization in Julia

Summary

Review questions and exercises

Unsupervised Learning in Anaconda

Introduction to unsupervised learning

Hierarchical clustering

k-means clustering

Introduction to Python packages – scipy

Introduction to Python packages – contrastive

Introduction to Python packages – sklearn (scikit-learn)

Introduction to R packages – rattle

Introduction to R packages – randomUniformForest

Introduction to R packages – Rmixmod

Implementation using Julia

Task view for Cluster Analysis

Summary

Review questions and exercises

Supervised Learning in Anaconda

A glance at supervised learning

Classification

Implementation of supervised learning via R

Implementation via Python

Implementation via Octave

Implementation via Julia

Summary

Review questions and exercises

Predictive Data Analytics – Modeling and Validation

Understanding predictive data analytics

Useful datasets

Predicting future events

Model selection

Granger causality test

Summary

Review questions and exercises

Anaconda Cloud

Introduction to Anaconda Cloud

Jupyter Notebook in depth

Replicating others' environments locally

Summary

Review questions and exercises

Distributed Computing, Parallel Computing, and HPCC

Introduction to distributed versus parallel computing

Understanding MPI

Parallel processing in Python

Compute nodes

Anaconda add-on

Introduction to HPCC

Summary

Review questions and exercises

References

Chapter 01: Ecosystem of Anaconda

Chapter 02: Anaconda Installation

Chapter 03: Data Basics

Chapter 04: Data Visualization

Chapter 05: Statistical Modeling in Anaconda

Chapter 06: Managing Packages

Chapter 07: Optimization in Anaconda

Chapter 08: Unsupervised Learning in Anaconda

Chapter 09: Supervised Learning in Anaconda

Chapter 10: Predictive Data Analytics – Modelling and Validation

Chapter 11: Anaconda Cloud

Chapter 12: Distributed Computing, Parallel Computing, and HPCC

Other Books You May Enjoy

Leave a review - let other readers know what you think

Customer Reviews

5 star

4 star

3 star

2 star

1 star

What this book covers

Chapter 1, Ecosystem of Anaconda, introduces some basic concepts such as the reasons why we use Anaconda and the advantages of using a full-fledged Anaconda and/or its baby version, Miniconda. Then, it covers the use of Anaconda online, without installation. We also test a few simple programs, written in R, Python, Julia, and Octave.

Chapter 2, Anaconda Installation, shows how to install Anaconda, test whether the installation is successful, how to launch Jupyter and use it to launch Python, how to launch Spyder and R, and how to find help. Most of these concepts or procedures are quite basic, so users who are quite confident with them can skip this chapter and go directly to the next chapter.

Chapter 3, Data Basics, discusses sources of open data, which include the Bureau of Labor Statistics, the Census Bureau, Professor French’s Data Library, the Federal Reserve’s Data Library, and the UCI (University of California at Irvin) Machine Learning Repository. After that, it explains how to input data; how to deal with missing data; how to sort, slice, and dice datasets; how to merge different datasets and data output. For different languages, such as Python, R, Julia and Octave, several relevant packages for data manipulation are introduced and discussed.

Chapter 4, Data Visualization, discusses various types of visual presentations, which include simple graphs, bar charts, pie charts, and histograms, written in different languages such as R, Python, and Julia. Visual presentations can help our audience understand our data better. For many complex concepts or theories, we could use visual presentations to help explain their logic and complexity. A typical example is the so-called bisection method or bisection search.

Chapter 5, Statistical Modeling in Anaconda, explains many important issues related to statistics, such as T-distribution, F-distribution, T-test, and F-test. We also discuss linear regression, how to deal with missing data, how to treat outliers, collinearity and its treatments, and how to run a multi-variable linear regression.

Chapter 6, Managing Packages, explains the importance of managing packages, how to find out all packages available for R, Python, and Julia, and how to find the manual for each package. In addition, we discuss the issue of package dependency and how to make our programming a little easier when dealing with packages.

Chapter 7, Optimization in Anaconda, discusses several optimization topics, including general optimization problems, expressing various kinds of optimization problems as LPPs, and quadratic optimization. Several examples are offered to make our discussion more practice-oriented, such as how to choose an optimal stock portfolio, how to optimize wealth and resources to promote sustainable development, and how much the government should really tax people. In addition, we introduce several packages for optimization in R, Python, Julia, and Octave.

Chapter 8, Unsupervised Learning in Anaconda, covers unsupervised learning. In particular, hierarchical clustering and k-means clustering are covered. As for R and Python, several related packages are looked at in details. For R: rattle, Rmixmod, and randomUniformForest; For Python: Scipy.cluster, Contrastive, and sklearn.

Chapter 9, Supervised Learning in Anaconda, discusses supervised learning, including classification, k-nearest neighbors algorithm, Bayes' classifiers, reinforcement learning, and specific R and Python-related modules, such as RTextTools and sklearn. In addition, you will see their implementation in R, Python, Julia, and Octave.

Chapter 10, Predictive Data Analytics – Modelling and Validation, covers predictive data analytics, modeling and validation, some useful datasets, time series analytics, how to predict future events, seasonality, and how to visualize our data. We mention prsklearn and catwalk for Python, datarobot, LiblineaR, and eclust for R, QuantEcon for Julia and ltfat for Octave.

Chapter 11, Anaconda Cloud, discusses Anaconda Cloud. Some topics include Jupyter Notebook in depth, different formats of Jupyter notebooks, how to share notebooks with your partners, how to share different projects over different platforms, how to share your working environments, and how to replicate other's environments locally.

Chapter 12, Distributed Computing, Parallel Computing, and HPCC, covers distributed computing and Anaconda Accelerate. When our data or tasks become more complex, we need a good system or a set of tools to process data and run complex algorithms. For this purpose, distributed computing is one solution. In particular, we will explain compute nodes, project add-ons, parallel processing, and advanced Python for data parallelism.

Hands-On Data Science with Anaconda

By : Yuxing Yan, James Yan

Hands-On Data Science with Anaconda

By: Yuxing Yan, James Yan

Overview of this book

Related Content you might be interested in

Current Title:

Hands-On Data Science with Anaconda

Python for Finance

Learning Quantitative Finance with R

Python for Finance Cookbook