Book Image

Large Scale Machine Learning with Python

By : Luca Massaron, Bastiaan Sjardin, Alberto Boschetti

Book Image

Large Scale Machine Learning with Python

By: Luca Massaron, Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.

Large Scale Machine Learning with Python

Large Scale Machine Learning with Python

Credits

About the Authors

About the Authors

About the Reviewer

About the Reviewer

www.PacktPub.com

www.PacktPub.com

Preface

Free Chapter

First Steps to Scalability

First Steps to Scalability

Explaining scalability in detail

Python for large scale machine learning

Python packages

Scalable Learning in Scikit-learn

Scalable Learning in Scikit-learn

Out-of-core learning

Streaming data from sources

Stochastic learning

Feature management with data streams

Fast SVM Implementations

Fast SVM Implementations

Datasets to experiment with on your own

Support Vector Machines

Feature selection by regularization

Including non-linearity in SGD

Hyperparameter tuning

Neural Networks and Deep Learning

Neural Networks and Deep Learning

The neural network architecture

Neural networks and regularization

Neural networks and hyperparameter optimization

Neural networks and decision boundaries

Deep learning at scale with H2O

Deep learning and unsupervised pretraining

Deep learning with theanets

Autoencoders and unsupervised learning

Deep Learning with TensorFlow

Deep Learning with TensorFlow

TensorFlow installation

Machine learning on TensorFlow with SkFlow

Keras and TensorFlow installation

Convolutional Neural Networks in TensorFlow through Keras

CNN's with an incremental approach

Classification and Regression Trees at Scale

Classification and Regression Trees at Scale

Bootstrap aggregation

Random forest and extremely randomized forest

Fast parameter optimization with randomized search

CART and boosting

Out-of-core CART with H2O

Unsupervised Learning at Scale

Unsupervised Learning at Scale

Unsupervised methods

Feature decomposition – PCA

Clustering – K-means

K-means with H2O

Distributed Environments – Hadoop and Spark

Distributed Environments – Hadoop and Spark

From a standalone machine to a bunch of nodes

Setting up the VM

The Hadoop ecosystem

Practical Machine Learning with Spark

Practical Machine Learning with Spark

Setting up the VM for this chapter

Sharing variables across cluster nodes

Data preprocessing in Spark

Machine learning with Spark

Introduction to GPUs and Theano

Introduction to GPUs and Theano

Theano – parallel computing on the GPU

Installing Theano

Index

Customer Reviews

5 star

0

4 star

0

3 star

0

2 star

0

1 star

0

Index

A

accumulators write-only variables
- sharing, across cluster nodes / Accumulators write-only variables
AdaBoost
- about / CART and boosting
Adam
- URL / Machine learning on TensorFlow with SkFlow
- about / Machine learning on TensorFlow with SkFlow
adaptive gradient (ADAGRAD)
- about / The neural network architecture
additive expansion
- about / Gradient Boosting Machines
AlexNet example
- URL / CNN's with an incremental approach
Anaconda
- about / Scientific distributions
- URL / Scientific distributions
- URL, for packages / Scientific distributions
architecture, neural network
- input layer / The input layer
- hidden layer / The hidden layer
- output layer / The output layer
area under the curve (AUC) / Describing the target
autoencoders
- unsupervised learning / Autoencoders and unsupervised learning
- about / Autoencoders
- deep learning, with stacked denoising autoencoders / Autoencoders
Average Stochastic Descent (ASGD) / Achieving SVM at scale with SGD

B

backpropagation
- about / The neural network architecture
- common problems / The neural network architecture
- with mini batch / The neural network architecture
batch normalization function
- URL / GPU Computing
bike sharing dataset
- about / The bike-sharing dataset
- URL / The bike-sharing dataset
BLAS
- URL / GPU computing
Boltzmann machines
- about / Autoencoders and unsupervised learning
boosting
- about / CART and boosting
bootstrap aggregation (bagging)
- about / Bootstrap aggregation
Boston datasets
- URL / Understanding the Scikit-learn SVM implementation
broadcast and accumulators variables
- sharing, across cluster nodes / Broadcast and accumulators together – an example
broadcast read-only variables
- sharing, across cluster nodes / Broadcast read-only variables

C

CART
- about / CART and boosting
- with H2O / Out-of-core CART with H2O
cells / Introducing Jupyter/IPython
Classification and Regression Trees (CART)
- about / GPU computing
click-through rate (CTR)
- about / Making large scale examples
climate
- about / Other useful packages to install on your system
clustering
- K-means / Clustering – K-means
cluster nodes
- variables, sharing across / Sharing variables across cluster nodes
- broadcast read-only variables, sharing across / Broadcast read-only variables
- accumulators write-only variables, sharing across / Accumulators write-only variables
- broadcast and accumulators variables, sharing across / Broadcast and accumulators together – an example
completeness
- about / Selection of the best K
conda
- about / Scientific distributions
ConvNets
- about / Convolutional Neural Networks in TensorFlow through Keras
Convolutional Neural Networks (CNN)
- about / Convolutional Neural Networks in TensorFlow through Keras
convolutional neural networks (CNN)
- in TensorFlow, through Keras / Convolutional Neural Networks in TensorFlow through Keras
- convolution layer / The convolution layer
- pooling layer / The pooling layer
- fully connected layer / The fully connected layer
- applying / CNN's with an incremental approach
- computing, with GPU / GPU Computing
convolution layer
- about / The convolution layer
covertype dataset
- about / The covertype dataset
- URL / The covertype dataset
CUDA
- about / Scale up with Python, Theano
- reference link / Theano
CUDA Toolkit
- URL / Theano
Cygwin openssh
- URL / Using the VM

D

data
- streaming, from resources / Streaming data from sources
data, streaming from resources
- about / Streaming data from sources
- datasets, experimenting with / Datasets to try the real thing yourself
- bike-sharing dataset, streaming / The first example – streaming the bike-sharing dataset
- pandas I/O tools, using / Using pandas I/O tools
- databases, working with / Working with databases
- ordering of instances, warning / Paying attention to the ordering of instances
data preprocessing, in Spark
- about / Data preprocessing in Spark
- JSON files, importing / JSON files and Spark DataFrames
- Spark DataFrames / JSON files and Spark DataFrames
- dealing, with missing data / Dealing with missing data
- tables in-memory, creating / Grouping and creating tables in-memory
- tables in-memory, grouping / Grouping and creating tables in-memory
- preprocessed DataFrame, writing to disk / Writing the preprocessed DataFrame or RDD to disk
- RDD, writing to disk / Writing the preprocessed DataFrame or RDD to disk
datasets
- reference link / Datasets to try the real thing yourself
- Buzz in social media dataset, reference link / Datasets to try the real thing yourself
- Census-Income (KDD) dataset, reference link / Datasets to try the real thing yourself
- KDD Cup 1999 dataset, reference link / Datasets to try the real thing yourself
- Bike-sharing dataset, reference link / Datasets to try the real thing yourself
- BlogFeedback dataset, reference link / Datasets to try the real thing yourself
- Covertype dataset, reference link / Datasets to try the real thing yourself
- using / Datasets to experiment with on your own
- bike sharing dataset / The bike-sharing dataset
- covertype dataset / The covertype dataset
data streams
- used, for feature management / Feature management with data streams
decision boundaries
- about / Neural networks and decision boundaries
Deep Belief Networks (DBN)
- about / Autoencoders and unsupervised learning
deep learning
- scaling, with H2O / Deep learning at scale with H2O
- unsupervised pretraining / Deep learning and unsupervised pretraining
denoising autoencoders
- about / Autoencoders and unsupervised learning, Autoencoders
direct acyclic graph (DAG) / Working with Spark DataFrames
Directed Acyclic Graph (DAG)
- about / pySpark
distributed filesystem (dfs)
- about / HDFS
distributed framework
- need for / Why do we need a distributed framework?

E

Elastic Compute Cloud (EC2)
- about / Out-of-core learning
Elbow method
- about / Selection of the best K
error correcting tournament (ECT)
- about / The covertype dataset crunched by VW
expansion
- about / The hidden layer
expectation (E)
- about / Clustering – K-means
expectation-maximization (EM) algorithm
- about / Clustering – K-means
explicit high-dimensional mappings
- about / Trying explicit high-dimensional mappings
Exploratory Data Analysis (EDA)
- about / Unsupervised methods
extreme gradient boosting (XGBoost)
- about / CART and boosting, XGBoost
- reference link / XGBoost
- regression / XGBoost regression
- variable importance, plotting / XGBoost and variable importance
- large datasets, streaming / XGBoost streaming large datasets
- model persistence / XGBoost model persistence
extreme gradient boosting (XGBoost), parameters
- eta [] / XGBoost
- min_child_weight [] / XGBoost
- max_depth [] / XGBoost
- subsample [] / XGBoost
- colsample_bytree [] / XGBoost
- lambda [] / XGBoost
- seed [] / XGBoost
extremely randomized forest
- about / Random forest and extremely randomized forest
- URL / Random forest and extremely randomized forest
extremely randomized trees
- for randomized search / Extremely randomized trees and large datasets

F

fast parameter
- optimizing, with randomized search / Fast parameter optimization with randomized search
feature decomposition
- principal component analysis (PCA) / Feature decomposition – PCA
feature management, with data streams
- about / Feature management with data streams
- target, describing / Describing the target
- hashing trick / The hashing trick
- basis transformations / Other basic transformations
- testing / Testing and validation in a stream
- validation / Testing and validation in a stream
- SGD, using / Trying SGD in action
feature selection
- by regularization / Feature selection by regularization
feedforward neural network
- about / The neural network architecture
forward propagation
- about / The neural network architecture
fully connected layer
- about / The fully connected layer

G

Gensim
- about / Gensim, Scaling LDA – memory, CPUs, and machines
- URL / Gensim
Gensim package
- URL / LDA
get-pip.py script
- URL / The installation of packages
- URL, for setup tool / The installation of packages
Git for Windows
- URL / XGBoost
- about / XGBoost
Gource
- URL / Nonlinear and faster with Vowpal Wabbit
GPU
- neural network, with theanets / Deep learning with theanets
- computing / GPU computing, GPU computing
- using, for convolutional neural networks (CNN) / GPU Computing
- reference link, for computing / GPU computing
- parallel computing, with Theano / Theano – parallel computing on the GPU
gradient boosting machine (GBM)
- about / CART and boosting, Gradient Boosting Machines
gradient boosting machine (GBM)
- max_depth / max_depth
- learning_rate / learning_rate
- subsample / Subsample
- warm_start / Faster GBM with warm_start
- speeding up, with warm_start / Speeding up GBM with warm_start
- GBM models, training / Training and storing GBM models
- GBM models, storing / Training and storing GBM models
graphical user interface (GUI)
- about / VirtualBox
gridsearch
- on H2O / Gridsearch on H2O, Stochastic gradient boosting and gridsearch on H2O
guests
- about / VirtualBox

H

H2O
- about / Scale up with Python, H2O
- URL / H2O
- deep learning, scaling / Deep learning at scale with H2O
- large scale deep learning / Large scale deep learning with H2O
- gridsearch / Gridsearch on H2O, Random forest and gridsearch on H2O, Stochastic gradient boosting and gridsearch on H2O
- CART / Out-of-core CART with H2O
- random forest / Random forest and gridsearch on H2O
- stochastic gradient boosting / Stochastic gradient boosting and gridsearch on H2O
- principal component analysis (PCA) / PCA with H2O
- K-means / K-means with H2O
Hadoop
- ecosystem / The Hadoop ecosystem
- architecture / Architecture
- Distributed File System (HDFS) / HDFS
- MapReduce / MapReduce
- Yet Another Resource Negotiator (YARN) / YARN
Hadoop Distributed File System (HDFS)
- about / Explaining scalability in detail, HDFS
hard-margin classifiers
- about / Support Vector Machines
Hierarchical Dirichler Processing (HDP)
- about / Scaling LDA – memory, CPUs, and machines
hinge loss
- about / Hinge loss and its variants
- variants / Hinge loss and its variants
homogeneity
- about / Selection of the best K
host
- about / VirtualBox
hyperparameter optimization
- about / Neural networks and hyperparameter optimization
hyperparameters
- tuning / Hyperparameter tuning

I

identity function
- about / Autoencoders
ImportError error
- about / The installation of packages
incremental learning
- about / Deep learning with large files – incremental learning
incremental PCA
- about / Incremental PCA
independent and identically distributed (i.i.d) / Stochastic gradient descent
input validator
- URL / Understanding the VW data format
IPython
- about / Introducing Jupyter/IPython
- URL / Introducing Jupyter/IPython
Iris datasets
- URL / Understanding the Scikit-learn SVM implementation

J

Java Development Kit (JDK)
- about / H2O
Jupyter
- about / Introducing Jupyter/IPython
- URL, for example / Introducing Jupyter/IPython
- URL, for installing / Introducing Jupyter/IPython
- URL / Introducing Jupyter/IPython
Jupyter Notebook Viewer
- URL / Introducing Jupyter/IPython
Just-in-Time (JIT) compiler
- about / Scale up with Python

K

K-means
- about / Clustering – K-means
- initialization methods / Initialization methods
- assumptions / K-means assumptions
- selecting / Selection of the best K
- scaling / Scaling K-means – mini-batch
- with H2O / K-means with H2O
Kaggle
- URL / XGBoost
KDD99 challenge
- reference / Spark on the KDD99 dataset
Keras
- about / Keras
- URL / Keras, Keras and TensorFlow installation
- installing / Keras and TensorFlow installation
- convolutional neural networks (CNN), in TensorFlow / Convolutional Neural Networks in TensorFlow through Keras
KSVM
- about / A few examples using reductions for SVM and neural nets
- reference link / A few examples using reductions for SVM and neural nets

L

large scale cloud services
- URL / Deep learning with large files – incremental learning
large scale deep learning
- with H2O / Large scale deep learning with H2O
large scale machine learning
- Python / Python for large scale machine learning
LaSVM
- URL / Other alternatives for SVM fast learning
- about / Other alternatives for SVM fast learning
Latent Dirichlect Allocation (LDA) / Nonlinear and faster with Vowpal Wabbit
Latent Dirichlet Allocation (LDA)
- about / Gensim, LDA
- scaling / Scaling LDA – memory, CPUs, and machines
Latent Semantic Analysis (LSA)
- about / Gensim
liblinear-cdblock library
- URL / Other alternatives for SVM fast learning
library for Support Vector Machines (LIBSVM)
- about / Scale up with Python
Library for Support Vector Machines (LIBSVM)
- about / Understanding the Scikit-learn SVM implementation
- URL / Understanding the Scikit-learn SVM implementation
linear regression
- with SGD / Linear regression with SGD
Linux precompiled binaries
- URL / Installing VW

M

machine learning
- on TensorFlow, with SkFlow / Machine learning on TensorFlow with SkFlow
- incremental learning / Deep learning with large files – incremental learning
machine learning, with Spark
- about / Machine learning with Spark
- dataset, reading / Reading the dataset
- feature engineering / Feature engineering
- training, giving to learner / Training a learner
- learner's performance, evaluating / Evaluating a learner's performance
- ML pipeline, power / The power of the ML pipeline
- manual tuning / Manual tuning
- cross-validation / Cross-validation, Final cleanup
MapReduce
- about / Explaining scalability in detail, MapReduce
- data chunker / MapReduce
- mapper / MapReduce
- shuffler / MapReduce
- reducer / MapReduce
- output writer / MapReduce
matplotlib package
- about / The matplotlib package
- URL / The matplotlib package
maximization (M)
- about / Clustering – K-means
memory profiler
- about / Other useful packages to install on your system
mini batches
- about / The neural network architecture
Minimalist GNU for Windows (MinGW) compiler
- about / XGBoost
- URL / XGBoost
MLlib / Machine learning with Spark
momentum
- training / The neural network architecture
MrJob
- about / MapReduce
music recommendation engine
- references / GPU Computing

N

Nesterov momentum
- about / The neural network architecture
neural network
- architecture / The neural network architecture
- softmax, for classification / The neural network architecture
- forward propagation / The neural network architecture
- backpropagation / The neural network architecture
- backpropagation, common problems / The neural network architecture
- backpropagation, with mini batch / The neural network architecture
- momentum, training / The neural network architecture
- Nesterov momentum / The neural network architecture
- adaptive gradient (ADAGRAD) / The neural network architecture
- resilient backpropagation (RPROP) / The neural network architecture
- RMSPROP / The neural network architecture
- architecture, selecting / What and how neural networks learn, Choosing the right architecture
- implementing / Neural networks in action
- sknn, parallelizing / Parallelization for sknn
- hyperparameter, optimizing / Neural networks and hyperparameter optimization
- decision boundaries / Neural networks and decision boundaries
- on GPU, with theanets / Deep learning with theanets
- performing, in TensorFlow / A neural network from scratch in TensorFlow
Neural network
- regularization / Neural networks and regularization
Neural Network Toolbox (NNT)
- about / Other useful packages to install on your system
NeuroLab
- about / Other useful packages to install on your system
nonlinear SVMs
- pursuing, by subsampling / Pursuing nonlinear SVMs by subsampling
NumPy
- about / Scale up with Python, NumPy
- URL / NumPy
NVIDIA
- URL / GPU computing
NVIDIA CUDA Toolkit
- URL / GPU computing

O

one-against-all (OAA)
- about / The covertype dataset crunched by VW
one-hot encoder
- reference link / The hashing trick
one-vs-all (OVA) / Achieving SVM at scale with SGD
one-vs-all (OVA) strategy / The Scikit-learn SGD implementation
OpenCL project
- URL / GPU computing
Openssh
- URL, for Windows / Using the VM
Ordinary least squares (OLS) / The Scikit-learn SGD implementation
out-of-core learning
- about / Out-of-core learning
- subsampling, as viable option / Subsampling as a viable option
- instances, optimizing / Optimizing one instance at a time
- building / Building an out-of-core learning system
overshooting
- about / The neural network architecture

P

pandas
- about / Scale up with Python, Pandas
- URL / Pandas
pandas documentation
- reference link / Using pandas I/O tools
pip
- about / The installation of packages
- URL / The installation of packages
pooling layer
- about / The pooling layer
pretraining
- about / Autoencoders and unsupervised learning
principal component analysis (PCA)
- about / Machine learning on TensorFlow with SkFlow, Feature decomposition – PCA
- reference link / Machine learning on TensorFlow with SkFlow
- randomized PCA / Randomized PCA
- incremental PCA / Incremental PCA
- sparse PCA / Sparse PCA
- with H2O / PCA with H2O
Principal Component Analysis (PCA)
- about / Autoencoders
Putty
- URL / Using the VM
PyPy
- about / Scale up with Python
- URL / Scale up with Python
pySpark
- about / pySpark
pySpark, actions
- reduce(function) / pySpark
- count() / pySpark
- countByKey() / pySpark
- collect() / pySpark
- first() / pySpark
- take(N) / pySpark
- takeSample(withReplacement, N, seed) / pySpark
- takeOrdered(N, ordering) / pySpark
- saveAsTextFile(path) / pySpark
pySpark, methods
- cache() / pySpark
- persist(storage) / pySpark
- unpersist() / pySpark
Python
- about / Introducing Python
- advantages / Introducing Python
- URL / Introducing Python, Step-by-step installation
- scaling up, with / Scale up with Python
- scaling out, with / Scale out with Python
- large scale machine learning / Python for large scale machine learning
- 2 and Python 3, selecting between / Choosing between Python 2 and Python 3
- installing / Installing Python, Step-by-step installation
- packages, installing / The installation of packages
- packages, upgrading / Package upgrades
- scientific distribution / Scientific distributions
- Jupyter / Introducing Jupyter/IPython
- IPython / Introducing Jupyter/IPython
- reference link / Describing the target
- integrating, with Vowpal Wabbit (VW) / Python integration
Python-Future
- URL / Choosing between Python 2 and Python 3
Python 2
- and Python 3, selecting between / Choosing between Python 2 and Python 3
Python 2-3 compatible code
- URL / Choosing between Python 2 and Python 3
Python 3
- and Python 2, selecting between / Choosing between Python 2 and Python 3
- URL, for compatibility / Choosing between Python 2 and Python 3
Python Package Index (PyPI)
- about / The installation of packages
- URL / The installation of packages
Python packages
- about / Python packages
- NumPy / NumPy
- SciPy / SciPy
- pandas / Pandas
- Scikit-learn / Scikit-learn
pyvw / Python integration

Q

quadratic programming
- URL / Support Vector Machines

R

radial basis functions (RBF) / Support Vector Machines
random forest
- about / Random forest and extremely randomized forest
random forest, parameters for bagging
- n_estimators / Random forest and extremely randomized forest
- max_features / Random forest and extremely randomized forest
- min_sample_leaf / Random forest and extremely randomized forest
- max_depth / Random forest and extremely randomized forest
- criterion / Random forest and extremely randomized forest
- min_samples_split / Random forest and extremely randomized forest
randomized PCA
- about / Randomized PCA
randomized search
- about / Neural networks and hyperparameter optimization, Fast parameter optimization with randomized search
- fast parameter, optimizing / Fast parameter optimization with randomized search
- extremely randomized trees / Extremely randomized trees and large datasets
- large datasets / Extremely randomized trees and large datasets
random subspace method
- about / Stochastic gradient boosting and gridsearch on H2O
Read-Eval-Print Loop (REPL)
- about / Step-by-step installation
receiver operating characteristic (ROC) / Describing the target
reconstruction error
- about / Autoencoders
rectified linear unit (ReLU)
- about / The neural network architecture
regularization
- feature selection / Feature selection by regularization
- about / Neural networks and regularization
residual network (ResNet)
- about / GPU Computing
resilient backpropagation (RPROP)
- about / The neural network architecture
Resilient Distributed Dataset (RDD)
- about / pySpark
RMSPROP
- about / The neural network architecture

S

scalability
- about / Explaining scalability in detail
- large scale examples, creating / Making large scale examples
- Python / Introducing Python
- Python, scaling up with / Scale up with Python
- Python, scaling out with / Scale out with Python
scientific distribution
- about / Scientific distributions
Scikit-learn
- about / The installation of packages, Scikit-learn, Understanding the Scikit-learn SVM implementation
- URL / Scikit-learn
- matplotlib package / The matplotlib package
- Gensim / Gensim
- H2O / H2O
- XGBoost / XGBoost
- Theano / Theano
- TensorFlow / TensorFlow
- sknn library / The sknn library
- theanets / Theanets
- Keras / Keras
- useful packages, installing / Other useful packages to install on your system
- other alternatives / Other alternatives for SVM fast learning
- Vowpal Wabbit (VW) / Nonlinear and faster with Vowpal Wabbit
Scikit-learn documentation
- reference link / Describing the target
scikit-neuralnetwork
- about / The sknn library
SciPy
- about / SciPy
- URL / SciPy
SGD
- used, for achieving SVM / Achieving SVM at scale with SGD
- non-linearity, including / Including non-linearity in SGD
- explicit high-dimensional mappings / Trying explicit high-dimensional mappings
- linear regression / Linear regression with SGD
SGD function, parameters
- lr / Keras and TensorFlow installation
- decay / Keras and TensorFlow installation
- momentum / Keras and TensorFlow installation
- nesterov / Keras and TensorFlow installation
- optimizer / Keras and TensorFlow installation
SGD Scikit-learn implementation
- reference link / Defining SGD learning parameters
sigmoid
- about / The neural network architecture
Silhouette
- about / Selection of the best K
Singular Value Decomposition (SVD)
- about / Feature decomposition – PCA
SkFlow
- machine learning, on TensorFlow / Machine learning on TensorFlow with SkFlow
sklearn.svm module, parameters
- C / Understanding the Scikit-learn SVM implementation
- kernel / Understanding the Scikit-learn SVM implementation
- degree / Understanding the Scikit-learn SVM implementation
- gamma / Understanding the Scikit-learn SVM implementation
- nu / Understanding the Scikit-learn SVM implementation
- epsilon / Understanding the Scikit-learn SVM implementation
- penalty / Understanding the Scikit-learn SVM implementation
- loss / Understanding the Scikit-learn SVM implementation
- dual / Understanding the Scikit-learn SVM implementation
sklearn module
- about / The installation of packages
sknn
- parallelizing / Parallelization for sknn
sknn library
- about / The sknn library
- URL / The sknn library
sknn package
- URL / Neural networks in action
SofiaML
- about / Other alternatives for SVM fast learning
- URL / Other alternatives for SVM fast learning
softmax
- for classification / The neural network architecture
Spark
- about / Spark, Machine learning with Spark
- pySpark / pySpark
- on KDD99 dataset / Spark on the KDD99 dataset
Spark, methods
- map(function) / pySpark
- flatMap(function) / pySpark
- filter(function) / pySpark
- sample(withReplacement, fraction, seed) / pySpark
- distinct() / pySpark
- coalesce(numPartitions) / pySpark
- repartition(numPartitions) / pySpark
- groupByKey() / pySpark
- reduceByKey(function) / pySpark
- sortByKey(ascending) / pySpark
- union(otherRDD) / pySpark
- intersection(otherRDD) / pySpark
- join(otherRDD) / pySpark
Spark DataFrames
- working with / Working with Spark DataFrames
sparse autoencoder
- URL / Autoencoders
sparse PCA
- about / Sparse PCA
sparsity parameters
- about / Autoencoders
Spyder
- about / Scientific distributions
SQLite
- reference link / Working with databases
stacked denoising autoencoders
- deep learning / Autoencoders
standalone machine
- big data, handling / From a standalone machine to a bunch of nodes
- distributed framework, need for / Why do we need a distributed framework?
steepest descent
- about / Gradient Boosting Machines
stochastic gradient boosting
- about / Stochastic gradient boosting and gridsearch on H2O
- URL / Stochastic gradient boosting and gridsearch on H2O
stochastic gradient descent (SGD) / Stochastic gradient descent
stochastic learning
- about / Stochastic learning
- batch gradient descent / Batch gradient descent
- stochastic gradient descent (SGD) / Stochastic gradient descent
- Scikit-learn SGD implementation / The Scikit-learn SGD implementation
- SGD learning parameters, defining / Defining SGD learning parameters
stream handling
- reference link / Datasets to try the real thing yourself
stride
- about / The convolution layer
subsample
- URL / Extremely randomized trees and large datasets
subsampling
- nonlinear SVMs, pursuing / Pursuing nonlinear SVMs by subsampling
subsampling layer
- about / The pooling layer
support vector machines (SVMs) / The Scikit-learn SGD implementation
Support Vector Machines (SVMs)
- about / Support Vector Machines
- hinge loss / Hinge loss and its variants
- Scikit-learn / Understanding the Scikit-learn SVM implementation
- nonlinear SVMs, pursuing / Pursuing nonlinear SVMs by subsampling
- achieving, with SGD / Achieving SVM at scale with SGD

T

tanh
- about / The neural network architecture
TDM-GCC x64
- URL / Theano
TensorFlow
- about / TensorFlow
- URL / TensorFlow
- references / TensorFlow
- installing / TensorFlow installation, Keras and TensorFlow installation
- operations / TensorFlow operations
- machine learning, with SkFlow / Machine learning on TensorFlow with SkFlow
- convolutional neural networks (CNN), through Keras / Convolutional Neural Networks in TensorFlow through Keras
TensorFlow, operations
- GPU, computing / GPU computing
- linear regression, with SGD / Linear regression with SGD
- neural network, performing / A neural network from scratch in TensorFlow
theanets
- about / Theanets, Deep learning with theanets
- URL / Theanets, Deep learning with theanets
- neural network, on GPU / Deep learning with theanets
Theano
- about / Theano
- URL / Theano, Theano – parallel computing on the GPU, Installing Theano
- URL, for installing / Theano
- used, for parallel computing on GPU / Theano – parallel computing on the GPU
- installing / Installing Theano
training_classification_error (.089)
- about / Large scale deep learning with H2O

U

UCI Machine Learning Repository / Datasets to try the real thing yourself
Uniform Resource Identifier (URI)
- about / HDFS
universal approximation theorem
- about / What and how neural networks learn
University of California, Irvine (UCI) / Datasets to try the real thing yourself
unsupervised learning
- autoencoders / Autoencoders and unsupervised learning
unsupervised methods
- about / Unsupervised methods
unsupervised pretraining
- about / Deep learning and unsupervised pretraining
US Census dataset
- URL / Scaling K-means – mini-batch

V

Vagrant
- about / Vagrant
- URL / Vagrant
validation_classification_error (.0954)
- about / Large scale deep learning with H2O
vanishing gradient problem
- about / The neural network architecture
variables
- sharing, across cluster nodes / Sharing variables across cluster nodes
VirtualBox
- about / VirtualBox
- URL / VirtualBox
virtual machine
- setting up / Setting up the VM for this chapter
virtual machines (VM)
- setting up / Setting up the VM
- VirtualBox / VirtualBox
- about / VirtualBox
- Vagrant / Vagrant
- using / Using the VM
Vowpal Wabbit (VW)
- about / Scale up with Python, Nonlinear and faster with Vowpal Wabbit
- installing / Installing VW
- URL, for compiling / Installing VW
- data format / Understanding the VW data format
- URL, for dataset / Understanding the VW data format
- Python, integration / Python integration
- examples / A few examples using reductions for SVM and neural nets
- URL, for neural networks / A few examples using reductions for SVM and neural nets
- faster bike-sharing example / Faster bike-sharing
- covertype dataset / The covertype dataset crunched by VW
vowpal_porpoise / Python integration

W

Wabbit Wappa / Python integration
Whitening
- about / Machine learning on TensorFlow with SkFlow
wide networks
- about / The hidden layer
WinPython
- URL / Scientific distributions
- about / Scientific distributions
word2vec
- about / Gensim

X

XGBoost
- about / Scale up with Python, XGBoost
- URL / XGBoost
- URL, for installing / XGBoost
- URL, for code / XGBoost

Y

Yet Another Resource Negotiator (YARN)
- about / Scale out with Python, YARN

Z

zero-padding
- about / The convolution layer