Book Image

Large Scale Machine Learning with Python

By : Bastiaan Sjardin, Alberto Boschetti
Book Image

Large Scale Machine Learning with Python

By: Bastiaan Sjardin, Alberto Boschetti

Overview of this book

Large Python machine learning projects involve new problems associated with specialized machine learning architectures and designs that many data scientists have yet to tackle. But finding algorithms and designing and building platforms that deal with large sets of data is a growing need. Data scientists have to manage and maintain increasingly complex data projects, and with the rise of big data comes an increasing demand for computational and algorithmic efficiency. Large Scale Machine Learning with Python uncovers a new wave of machine learning algorithms that meet scalability demands together with a high predictive accuracy. Dive into scalable machine learning and the three forms of scalability. Speed up algorithms that can be used on a desktop computer with tips on parallelization and memory allocation. Get to grips with new algorithms that are specifically designed for large projects and can handle bigger files, and learn about machine learning in big data environments. We will also cover the most effective machine learning techniques on a map reduce framework in Hadoop and Spark in Python.
Table of Contents (17 chapters)
Large Scale Machine Learning with Python
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Preface
Index

Index

A

  • accumulators write-only variables
    • sharing, across cluster nodes / Accumulators write-only variables
  • AdaBoost
    • about / CART and boosting
  • Adam
    • URL / Machine learning on TensorFlow with SkFlow
    • about / Machine learning on TensorFlow with SkFlow
  • adaptive gradient (ADAGRAD)
    • about / The neural network architecture
  • additive expansion
    • about / Gradient Boosting Machines
  • AlexNet example
    • URL / CNN's with an incremental approach
  • Anaconda
    • about / Scientific distributions
    • URL / Scientific distributions
    • URL, for packages / Scientific distributions
  • architecture, neural network
    • input layer / The input layer
    • hidden layer / The hidden layer
    • output layer / The output layer
  • area under the curve (AUC) / Describing the target
  • autoencoders
    • unsupervised learning / Autoencoders and unsupervised learning
    • about / Autoencoders
    • deep learning, with stacked denoising autoencoders / Autoencoders
  • Average Stochastic Descent (ASGD) / Achieving SVM at scale with SGD

B

  • backpropagation
    • about / The neural network architecture
    • common problems / The neural network architecture
    • with mini batch / The neural network architecture
  • batch normalization function
    • URL / GPU Computing
  • bike sharing dataset
    • about / The bike-sharing dataset
    • URL / The bike-sharing dataset
  • BLAS
    • URL / GPU computing
  • Boltzmann machines
    • about / Autoencoders and unsupervised learning
  • boosting
    • about / CART and boosting
  • bootstrap aggregation (bagging)
    • about / Bootstrap aggregation
  • Boston datasets
    • URL / Understanding the Scikit-learn SVM implementation
  • broadcast and accumulators variables
    • sharing, across cluster nodes / Broadcast and accumulators together – an example
  • broadcast read-only variables
    • sharing, across cluster nodes / Broadcast read-only variables

C

  • CART
    • about / CART and boosting
    • with H2O / Out-of-core CART with H2O
  • cells / Introducing Jupyter/IPython
  • Classification and Regression Trees (CART)
    • about / GPU computing
  • click-through rate (CTR)
    • about / Making large scale examples
  • climate
    • about / Other useful packages to install on your system
  • clustering
    • K-means / Clustering – K-means
  • cluster nodes
    • variables, sharing across / Sharing variables across cluster nodes
    • broadcast read-only variables, sharing across / Broadcast read-only variables
    • accumulators write-only variables, sharing across / Accumulators write-only variables
    • broadcast and accumulators variables, sharing across / Broadcast and accumulators together – an example
  • completeness
    • about / Selection of the best K
  • conda
    • about / Scientific distributions
  • ConvNets
    • about / Convolutional Neural Networks in TensorFlow through Keras
  • Convolutional Neural Networks (CNN)
    • about / Convolutional Neural Networks in TensorFlow through Keras
  • convolutional neural networks (CNN)
    • in TensorFlow, through Keras / Convolutional Neural Networks in TensorFlow through Keras
    • convolution layer / The convolution layer
    • pooling layer / The pooling layer
    • fully connected layer / The fully connected layer
    • applying / CNN's with an incremental approach
    • computing, with GPU / GPU Computing
  • convolution layer
    • about / The convolution layer
  • covertype dataset
    • about / The covertype dataset
    • URL / The covertype dataset
  • CUDA
    • about / Scale up with Python, Theano
    • reference link / Theano
  • CUDA Toolkit
    • URL / Theano
  • Cygwin openssh
    • URL / Using the VM

D

  • data
    • streaming, from resources / Streaming data from sources
  • data, streaming from resources
    • about / Streaming data from sources
    • datasets, experimenting with / Datasets to try the real thing yourself
    • bike-sharing dataset, streaming / The first example – streaming the bike-sharing dataset
    • pandas I/O tools, using / Using pandas I/O tools
    • databases, working with / Working with databases
    • ordering of instances, warning / Paying attention to the ordering of instances
  • data preprocessing, in Spark
    • about / Data preprocessing in Spark
    • JSON files, importing / JSON files and Spark DataFrames
    • Spark DataFrames / JSON files and Spark DataFrames
    • dealing, with missing data / Dealing with missing data
    • tables in-memory, creating / Grouping and creating tables in-memory
    • tables in-memory, grouping / Grouping and creating tables in-memory
    • preprocessed DataFrame, writing to disk / Writing the preprocessed DataFrame or RDD to disk
    • RDD, writing to disk / Writing the preprocessed DataFrame or RDD to disk
  • datasets
    • reference link / Datasets to try the real thing yourself
    • Buzz in social media dataset, reference link / Datasets to try the real thing yourself
    • Census-Income (KDD) dataset, reference link / Datasets to try the real thing yourself
    • KDD Cup 1999 dataset, reference link / Datasets to try the real thing yourself
    • Bike-sharing dataset, reference link / Datasets to try the real thing yourself
    • BlogFeedback dataset, reference link / Datasets to try the real thing yourself
    • Covertype dataset, reference link / Datasets to try the real thing yourself
    • using / Datasets to experiment with on your own
    • bike sharing dataset / The bike-sharing dataset
    • covertype dataset / The covertype dataset
  • data streams
    • used, for feature management / Feature management with data streams
  • decision boundaries
    • about / Neural networks and decision boundaries
  • Deep Belief Networks (DBN)
    • about / Autoencoders and unsupervised learning
  • deep learning
    • scaling, with H2O / Deep learning at scale with H2O
    • unsupervised pretraining / Deep learning and unsupervised pretraining
  • denoising autoencoders
    • about / Autoencoders and unsupervised learning, Autoencoders
  • direct acyclic graph (DAG) / Working with Spark DataFrames
  • Directed Acyclic Graph (DAG)
    • about / pySpark
  • distributed filesystem (dfs)
    • about / HDFS
  • distributed framework
    • need for / Why do we need a distributed framework?

E

  • Elastic Compute Cloud (EC2)
    • about / Out-of-core learning
  • Elbow method
    • about / Selection of the best K
  • error correcting tournament (ECT)
    • about / The covertype dataset crunched by VW
  • expansion
    • about / The hidden layer
  • expectation (E)
    • about / Clustering – K-means
  • expectation-maximization (EM) algorithm
    • about / Clustering – K-means
  • explicit high-dimensional mappings
    • about / Trying explicit high-dimensional mappings
  • Exploratory Data Analysis (EDA)
    • about / Unsupervised methods
  • extreme gradient boosting (XGBoost)
    • about / CART and boosting, XGBoost
    • reference link / XGBoost
    • regression / XGBoost regression
    • variable importance, plotting / XGBoost and variable importance
    • large datasets, streaming / XGBoost streaming large datasets
    • model persistence / XGBoost model persistence
  • extreme gradient boosting (XGBoost), parameters
    • eta [] / XGBoost
    • min_child_weight [] / XGBoost
    • max_depth [] / XGBoost
    • subsample [] / XGBoost
    • colsample_bytree [] / XGBoost
    • lambda [] / XGBoost
    • seed [] / XGBoost
  • extremely randomized forest
    • about / Random forest and extremely randomized forest
    • URL / Random forest and extremely randomized forest
  • extremely randomized trees
    • for randomized search / Extremely randomized trees and large datasets

F

  • fast parameter
    • optimizing, with randomized search / Fast parameter optimization with randomized search
  • feature decomposition
    • principal component analysis (PCA) / Feature decomposition – PCA
  • feature management, with data streams
    • about / Feature management with data streams
    • target, describing / Describing the target
    • hashing trick / The hashing trick
    • basis transformations / Other basic transformations
    • testing / Testing and validation in a stream
    • validation / Testing and validation in a stream
    • SGD, using / Trying SGD in action
  • feature selection
    • by regularization / Feature selection by regularization
  • feedforward neural network
    • about / The neural network architecture
  • forward propagation
    • about / The neural network architecture
  • fully connected layer
    • about / The fully connected layer

G

  • Gensim
    • about / Gensim, Scaling LDA – memory, CPUs, and machines
    • URL / Gensim
  • Gensim package
    • URL / LDA
  • get-pip.py script
    • URL / The installation of packages
    • URL, for setup tool / The installation of packages
  • Git for Windows
    • URL / XGBoost
    • about / XGBoost
  • Gource
    • URL / Nonlinear and faster with Vowpal Wabbit
  • GPU
    • neural network, with theanets / Deep learning with theanets
    • computing / GPU computing, GPU computing
    • using, for convolutional neural networks (CNN) / GPU Computing
    • reference link, for computing / GPU computing
    • parallel computing, with Theano / Theano – parallel computing on the GPU
  • gradient boosting machine (GBM)
    • about / CART and boosting, Gradient Boosting Machines
  • gradient boosting machine (GBM)
    • max_depth / max_depth
    • learning_rate / learning_rate
    • subsample / Subsample
    • warm_start / Faster GBM with warm_start
    • speeding up, with warm_start / Speeding up GBM with warm_start
    • GBM models, training / Training and storing GBM models
    • GBM models, storing / Training and storing GBM models
  • graphical user interface (GUI)
    • about / VirtualBox
  • gridsearch
    • on H2O / Gridsearch on H2O, Stochastic gradient boosting and gridsearch on H2O
  • guests
    • about / VirtualBox

H

  • H2O
    • about / Scale up with Python, H2O
    • URL / H2O
    • deep learning, scaling / Deep learning at scale with H2O
    • large scale deep learning / Large scale deep learning with H2O
    • gridsearch / Gridsearch on H2O, Random forest and gridsearch on H2O, Stochastic gradient boosting and gridsearch on H2O
    • CART / Out-of-core CART with H2O
    • random forest / Random forest and gridsearch on H2O
    • stochastic gradient boosting / Stochastic gradient boosting and gridsearch on H2O
    • principal component analysis (PCA) / PCA with H2O
    • K-means / K-means with H2O
  • Hadoop
    • ecosystem / The Hadoop ecosystem
    • architecture / Architecture
    • Distributed File System (HDFS) / HDFS
    • MapReduce / MapReduce
    • Yet Another Resource Negotiator (YARN) / YARN
  • Hadoop Distributed File System (HDFS)
    • about / Explaining scalability in detail, HDFS
  • hard-margin classifiers
    • about / Support Vector Machines
  • Hierarchical Dirichler Processing (HDP)
    • about / Scaling LDA – memory, CPUs, and machines
  • hinge loss
    • about / Hinge loss and its variants
    • variants / Hinge loss and its variants
  • homogeneity
    • about / Selection of the best K
  • host
    • about / VirtualBox
  • hyperparameter optimization
    • about / Neural networks and hyperparameter optimization
  • hyperparameters
    • tuning / Hyperparameter tuning

I

  • identity function
    • about / Autoencoders
  • ImportError error
    • about / The installation of packages
  • incremental learning
    • about / Deep learning with large files – incremental learning
  • incremental PCA
    • about / Incremental PCA
  • independent and identically distributed (i.i.d) / Stochastic gradient descent
  • input validator
    • URL / Understanding the VW data format
  • IPython
    • about / Introducing Jupyter/IPython
    • URL / Introducing Jupyter/IPython
  • Iris datasets
    • URL / Understanding the Scikit-learn SVM implementation

J

  • Java Development Kit (JDK)
    • about / H2O
  • Jupyter
    • about / Introducing Jupyter/IPython
    • URL, for example / Introducing Jupyter/IPython
    • URL, for installing / Introducing Jupyter/IPython
    • URL / Introducing Jupyter/IPython
  • Jupyter Notebook Viewer
    • URL / Introducing Jupyter/IPython
  • Just-in-Time (JIT) compiler
    • about / Scale up with Python

K

  • K-means
    • about / Clustering – K-means
    • initialization methods / Initialization methods
    • assumptions / K-means assumptions
    • selecting / Selection of the best K
    • scaling / Scaling K-means – mini-batch
    • with H2O / K-means with H2O
  • Kaggle
    • URL / XGBoost
  • KDD99 challenge
    • reference / Spark on the KDD99 dataset
  • Keras
    • about / Keras
    • URL / Keras, Keras and TensorFlow installation
    • installing / Keras and TensorFlow installation
    • convolutional neural networks (CNN), in TensorFlow / Convolutional Neural Networks in TensorFlow through Keras
  • KSVM
    • about / A few examples using reductions for SVM and neural nets
    • reference link / A few examples using reductions for SVM and neural nets

L

  • large scale cloud services
    • URL / Deep learning with large files – incremental learning
  • large scale deep learning
    • with H2O / Large scale deep learning with H2O
  • large scale machine learning
    • Python / Python for large scale machine learning
  • LaSVM
    • URL / Other alternatives for SVM fast learning
    • about / Other alternatives for SVM fast learning
  • Latent Dirichlect Allocation (LDA) / Nonlinear and faster with Vowpal Wabbit
  • Latent Dirichlet Allocation (LDA)
    • about / Gensim, LDA
    • scaling / Scaling LDA – memory, CPUs, and machines
  • Latent Semantic Analysis (LSA)
    • about / Gensim
  • liblinear-cdblock library
    • URL / Other alternatives for SVM fast learning
  • library for Support Vector Machines (LIBSVM)
    • about / Scale up with Python
  • Library for Support Vector Machines (LIBSVM)
    • about / Understanding the Scikit-learn SVM implementation
    • URL / Understanding the Scikit-learn SVM implementation
  • linear regression
    • with SGD / Linear regression with SGD
  • Linux precompiled binaries
    • URL / Installing VW

M

  • machine learning
    • on TensorFlow, with SkFlow / Machine learning on TensorFlow with SkFlow
    • incremental learning / Deep learning with large files – incremental learning
  • machine learning, with Spark
    • about / Machine learning with Spark
    • dataset, reading / Reading the dataset
    • feature engineering / Feature engineering
    • training, giving to learner / Training a learner
    • learner's performance, evaluating / Evaluating a learner's performance
    • ML pipeline, power / The power of the ML pipeline
    • manual tuning / Manual tuning
    • cross-validation / Cross-validation, Final cleanup
  • MapReduce
    • about / Explaining scalability in detail, MapReduce
    • data chunker / MapReduce
    • mapper / MapReduce
    • shuffler / MapReduce
    • reducer / MapReduce
    • output writer / MapReduce
  • matplotlib package
    • about / The matplotlib package
    • URL / The matplotlib package
  • maximization (M)
    • about / Clustering – K-means
  • memory profiler
    • about / Other useful packages to install on your system
  • mini batches
    • about / The neural network architecture
  • Minimalist GNU for Windows (MinGW) compiler
    • about / XGBoost
    • URL / XGBoost
  • MLlib / Machine learning with Spark
  • momentum
    • training / The neural network architecture
  • MrJob
    • about / MapReduce
  • music recommendation engine
    • references / GPU Computing

N

  • Nesterov momentum
    • about / The neural network architecture
  • neural network
    • architecture / The neural network architecture
    • softmax, for classification / The neural network architecture
    • forward propagation / The neural network architecture
    • backpropagation / The neural network architecture
    • backpropagation, common problems / The neural network architecture
    • backpropagation, with mini batch / The neural network architecture
    • momentum, training / The neural network architecture
    • Nesterov momentum / The neural network architecture
    • adaptive gradient (ADAGRAD) / The neural network architecture
    • resilient backpropagation (RPROP) / The neural network architecture
    • RMSPROP / The neural network architecture
    • architecture, selecting / What and how neural networks learn, Choosing the right architecture
    • implementing / Neural networks in action
    • sknn, parallelizing / Parallelization for sknn
    • hyperparameter, optimizing / Neural networks and hyperparameter optimization
    • decision boundaries / Neural networks and decision boundaries
    • on GPU, with theanets / Deep learning with theanets
    • performing, in TensorFlow / A neural network from scratch in TensorFlow
  • Neural network
    • regularization / Neural networks and regularization
  • Neural Network Toolbox (NNT)
    • about / Other useful packages to install on your system
  • NeuroLab
    • about / Other useful packages to install on your system
  • nonlinear SVMs
    • pursuing, by subsampling / Pursuing nonlinear SVMs by subsampling
  • NumPy
    • about / Scale up with Python, NumPy
    • URL / NumPy
  • NVIDIA
    • URL / GPU computing
  • NVIDIA CUDA Toolkit
    • URL / GPU computing

O

  • one-against-all (OAA)
    • about / The covertype dataset crunched by VW
  • one-hot encoder
    • reference link / The hashing trick
  • one-vs-all (OVA) / Achieving SVM at scale with SGD
  • one-vs-all (OVA) strategy / The Scikit-learn SGD implementation
  • OpenCL project
    • URL / GPU computing
  • Openssh
    • URL, for Windows / Using the VM
  • Ordinary least squares (OLS) / The Scikit-learn SGD implementation
  • out-of-core learning
    • about / Out-of-core learning
    • subsampling, as viable option / Subsampling as a viable option
    • instances, optimizing / Optimizing one instance at a time
    • building / Building an out-of-core learning system
  • overshooting
    • about / The neural network architecture

P

  • pandas
    • about / Scale up with Python, Pandas
    • URL / Pandas
  • pandas documentation
    • reference link / Using pandas I/O tools
  • pip
    • about / The installation of packages
    • URL / The installation of packages
  • pooling layer
    • about / The pooling layer
  • pretraining
    • about / Autoencoders and unsupervised learning
  • principal component analysis (PCA)
    • about / Machine learning on TensorFlow with SkFlow, Feature decomposition – PCA
    • reference link / Machine learning on TensorFlow with SkFlow
    • randomized PCA / Randomized PCA
    • incremental PCA / Incremental PCA
    • sparse PCA / Sparse PCA
    • with H2O / PCA with H2O
  • Principal Component Analysis (PCA)
    • about / Autoencoders
  • Putty
    • URL / Using the VM
  • PyPy
    • about / Scale up with Python
    • URL / Scale up with Python
  • pySpark
    • about / pySpark
  • pySpark, actions
    • reduce(function) / pySpark
    • count() / pySpark
    • countByKey() / pySpark
    • collect() / pySpark
    • first() / pySpark
    • take(N) / pySpark
    • takeSample(withReplacement, N, seed) / pySpark
    • takeOrdered(N, ordering) / pySpark
    • saveAsTextFile(path) / pySpark
  • pySpark, methods
    • cache() / pySpark
    • persist(storage) / pySpark
    • unpersist() / pySpark
  • Python
    • about / Introducing Python
    • advantages / Introducing Python
    • URL / Introducing Python, Step-by-step installation
    • scaling up, with / Scale up with Python
    • scaling out, with / Scale out with Python
    • large scale machine learning / Python for large scale machine learning
    • 2 and Python 3, selecting between / Choosing between Python 2 and Python 3
    • installing / Installing Python, Step-by-step installation
    • packages, installing / The installation of packages
    • packages, upgrading / Package upgrades
    • scientific distribution / Scientific distributions
    • Jupyter / Introducing Jupyter/IPython
    • IPython / Introducing Jupyter/IPython
    • reference link / Describing the target
    • integrating, with Vowpal Wabbit (VW) / Python integration
  • Python-Future
    • URL / Choosing between Python 2 and Python 3
  • Python 2
    • and Python 3, selecting between / Choosing between Python 2 and Python 3
  • Python 2-3 compatible code
    • URL / Choosing between Python 2 and Python 3
  • Python 3
    • and Python 2, selecting between / Choosing between Python 2 and Python 3
    • URL, for compatibility / Choosing between Python 2 and Python 3
  • Python Package Index (PyPI)
    • about / The installation of packages
    • URL / The installation of packages
  • Python packages
    • about / Python packages
    • NumPy / NumPy
    • SciPy / SciPy
    • pandas / Pandas
    • Scikit-learn / Scikit-learn
  • pyvw / Python integration

Q

  • quadratic programming
    • URL / Support Vector Machines

R

  • radial basis functions (RBF) / Support Vector Machines
  • random forest
    • about / Random forest and extremely randomized forest
  • random forest, parameters for bagging
    • n_estimators / Random forest and extremely randomized forest
    • max_features / Random forest and extremely randomized forest
    • min_sample_leaf / Random forest and extremely randomized forest
    • max_depth / Random forest and extremely randomized forest
    • criterion / Random forest and extremely randomized forest
    • min_samples_split / Random forest and extremely randomized forest
  • randomized PCA
    • about / Randomized PCA
  • randomized search
    • about / Neural networks and hyperparameter optimization, Fast parameter optimization with randomized search
    • fast parameter, optimizing / Fast parameter optimization with randomized search
    • extremely randomized trees / Extremely randomized trees and large datasets
    • large datasets / Extremely randomized trees and large datasets
  • random subspace method
    • about / Stochastic gradient boosting and gridsearch on H2O
  • Read-Eval-Print Loop (REPL)
    • about / Step-by-step installation
  • receiver operating characteristic (ROC) / Describing the target
  • reconstruction error
    • about / Autoencoders
  • rectified linear unit (ReLU)
    • about / The neural network architecture
  • regularization
    • feature selection / Feature selection by regularization
    • about / Neural networks and regularization
  • residual network (ResNet)
    • about / GPU Computing
  • resilient backpropagation (RPROP)
    • about / The neural network architecture
  • Resilient Distributed Dataset (RDD)
    • about / pySpark
  • RMSPROP
    • about / The neural network architecture

S

  • scalability
    • about / Explaining scalability in detail
    • large scale examples, creating / Making large scale examples
    • Python / Introducing Python
    • Python, scaling up with / Scale up with Python
    • Python, scaling out with / Scale out with Python
  • scientific distribution
    • about / Scientific distributions
  • Scikit-learn
    • about / The installation of packages, Scikit-learn, Understanding the Scikit-learn SVM implementation
    • URL / Scikit-learn
    • matplotlib package / The matplotlib package
    • Gensim / Gensim
    • H2O / H2O
    • XGBoost / XGBoost
    • Theano / Theano
    • TensorFlow / TensorFlow
    • sknn library / The sknn library
    • theanets / Theanets
    • Keras / Keras
    • useful packages, installing / Other useful packages to install on your system
    • other alternatives / Other alternatives for SVM fast learning
    • Vowpal Wabbit (VW) / Nonlinear and faster with Vowpal Wabbit
  • Scikit-learn documentation
    • reference link / Describing the target
  • scikit-neuralnetwork
    • about / The sknn library
  • SciPy
    • about / SciPy
    • URL / SciPy
  • SGD
    • used, for achieving SVM / Achieving SVM at scale with SGD
    • non-linearity, including / Including non-linearity in SGD
    • explicit high-dimensional mappings / Trying explicit high-dimensional mappings
    • linear regression / Linear regression with SGD
  • SGD function, parameters
    • lr / Keras and TensorFlow installation
    • decay / Keras and TensorFlow installation
    • momentum / Keras and TensorFlow installation
    • nesterov / Keras and TensorFlow installation
    • optimizer / Keras and TensorFlow installation
  • SGD Scikit-learn implementation
    • reference link / Defining SGD learning parameters
  • sigmoid
    • about / The neural network architecture
  • Silhouette
    • about / Selection of the best K
  • Singular Value Decomposition (SVD)
    • about / Feature decomposition – PCA
  • SkFlow
    • machine learning, on TensorFlow / Machine learning on TensorFlow with SkFlow
  • sklearn.svm module, parameters
    • C / Understanding the Scikit-learn SVM implementation
    • kernel / Understanding the Scikit-learn SVM implementation
    • degree / Understanding the Scikit-learn SVM implementation
    • gamma / Understanding the Scikit-learn SVM implementation
    • nu / Understanding the Scikit-learn SVM implementation
    • epsilon / Understanding the Scikit-learn SVM implementation
    • penalty / Understanding the Scikit-learn SVM implementation
    • loss / Understanding the Scikit-learn SVM implementation
    • dual / Understanding the Scikit-learn SVM implementation
  • sklearn module
    • about / The installation of packages
  • sknn
    • parallelizing / Parallelization for sknn
  • sknn library
    • about / The sknn library
    • URL / The sknn library
  • sknn package
    • URL / Neural networks in action
  • SofiaML
    • about / Other alternatives for SVM fast learning
    • URL / Other alternatives for SVM fast learning
  • softmax
    • for classification / The neural network architecture
  • Spark
    • about / Spark, Machine learning with Spark
    • pySpark / pySpark
    • on KDD99 dataset / Spark on the KDD99 dataset
  • Spark, methods
    • map(function) / pySpark
    • flatMap(function) / pySpark
    • filter(function) / pySpark
    • sample(withReplacement, fraction, seed) / pySpark
    • distinct() / pySpark
    • coalesce(numPartitions) / pySpark
    • repartition(numPartitions) / pySpark
    • groupByKey() / pySpark
    • reduceByKey(function) / pySpark
    • sortByKey(ascending) / pySpark
    • union(otherRDD) / pySpark
    • intersection(otherRDD) / pySpark
    • join(otherRDD) / pySpark
  • Spark DataFrames
    • working with / Working with Spark DataFrames
  • sparse autoencoder
    • URL / Autoencoders
  • sparse PCA
    • about / Sparse PCA
  • sparsity parameters
    • about / Autoencoders
  • Spyder
    • about / Scientific distributions
  • SQLite
    • reference link / Working with databases
  • stacked denoising autoencoders
    • deep learning / Autoencoders
  • standalone machine
    • big data, handling / From a standalone machine to a bunch of nodes
    • distributed framework, need for / Why do we need a distributed framework?
  • steepest descent
    • about / Gradient Boosting Machines
  • stochastic gradient boosting
    • about / Stochastic gradient boosting and gridsearch on H2O
    • URL / Stochastic gradient boosting and gridsearch on H2O
  • stochastic gradient descent (SGD) / Stochastic gradient descent
  • stochastic learning
    • about / Stochastic learning
    • batch gradient descent / Batch gradient descent
    • stochastic gradient descent (SGD) / Stochastic gradient descent
    • Scikit-learn SGD implementation / The Scikit-learn SGD implementation
    • SGD learning parameters, defining / Defining SGD learning parameters
  • stream handling
    • reference link / Datasets to try the real thing yourself
  • stride
    • about / The convolution layer
  • subsample
    • URL / Extremely randomized trees and large datasets
  • subsampling
    • nonlinear SVMs, pursuing / Pursuing nonlinear SVMs by subsampling
  • subsampling layer
    • about / The pooling layer
  • support vector machines (SVMs) / The Scikit-learn SGD implementation
  • Support Vector Machines (SVMs)
    • about / Support Vector Machines
    • hinge loss / Hinge loss and its variants
    • Scikit-learn / Understanding the Scikit-learn SVM implementation
    • nonlinear SVMs, pursuing / Pursuing nonlinear SVMs by subsampling
    • achieving, with SGD / Achieving SVM at scale with SGD

T

  • tanh
    • about / The neural network architecture
  • TDM-GCC x64
    • URL / Theano
  • TensorFlow
    • about / TensorFlow
    • URL / TensorFlow
    • references / TensorFlow
    • installing / TensorFlow installation, Keras and TensorFlow installation
    • operations / TensorFlow operations
    • machine learning, with SkFlow / Machine learning on TensorFlow with SkFlow
    • convolutional neural networks (CNN), through Keras / Convolutional Neural Networks in TensorFlow through Keras
  • TensorFlow, operations
    • GPU, computing / GPU computing
    • linear regression, with SGD / Linear regression with SGD
    • neural network, performing / A neural network from scratch in TensorFlow
  • theanets
    • about / Theanets, Deep learning with theanets
    • URL / Theanets, Deep learning with theanets
    • neural network, on GPU / Deep learning with theanets
  • Theano
    • about / Theano
    • URL / Theano, Theano – parallel computing on the GPU, Installing Theano
    • URL, for installing / Theano
    • used, for parallel computing on GPU / Theano – parallel computing on the GPU
    • installing / Installing Theano
  • training_classification_error (.089)
    • about / Large scale deep learning with H2O

U

  • UCI Machine Learning Repository / Datasets to try the real thing yourself
  • Uniform Resource Identifier (URI)
    • about / HDFS
  • universal approximation theorem
    • about / What and how neural networks learn
  • University of California, Irvine (UCI) / Datasets to try the real thing yourself
  • unsupervised learning
    • autoencoders / Autoencoders and unsupervised learning
  • unsupervised methods
    • about / Unsupervised methods
  • unsupervised pretraining
    • about / Deep learning and unsupervised pretraining
  • US Census dataset
    • URL / Scaling K-means – mini-batch

V

  • Vagrant
    • about / Vagrant
    • URL / Vagrant
  • validation_classification_error (.0954)
    • about / Large scale deep learning with H2O
  • vanishing gradient problem
    • about / The neural network architecture
  • variables
    • sharing, across cluster nodes / Sharing variables across cluster nodes
  • VirtualBox
    • about / VirtualBox
    • URL / VirtualBox
  • virtual machine
    • setting up / Setting up the VM for this chapter
  • virtual machines (VM)
    • setting up / Setting up the VM
    • VirtualBox / VirtualBox
    • about / VirtualBox
    • Vagrant / Vagrant
    • using / Using the VM
  • Vowpal Wabbit (VW)
    • about / Scale up with Python, Nonlinear and faster with Vowpal Wabbit
    • installing / Installing VW
    • URL, for compiling / Installing VW
    • data format / Understanding the VW data format
    • URL, for dataset / Understanding the VW data format
    • Python, integration / Python integration
    • examples / A few examples using reductions for SVM and neural nets
    • URL, for neural networks / A few examples using reductions for SVM and neural nets
    • faster bike-sharing example / Faster bike-sharing
    • covertype dataset / The covertype dataset crunched by VW
  • vowpal_porpoise / Python integration

W

  • Wabbit Wappa / Python integration
  • Whitening
    • about / Machine learning on TensorFlow with SkFlow
  • wide networks
    • about / The hidden layer
  • WinPython
    • URL / Scientific distributions
    • about / Scientific distributions
  • word2vec
    • about / Gensim

X

  • XGBoost
    • about / Scale up with Python, XGBoost
    • URL / XGBoost
    • URL, for installing / XGBoost
    • URL, for code / XGBoost

Y

  • Yet Another Resource Negotiator (YARN)
    • about / Scale out with Python, YARN

Z

  • zero-padding
    • about / The convolution layer