Index
A
- AcceptedAnswerId / Preselection and processing of attributes
- access key
- about / Using Amazon Web Services
- add-one smoothing / Accounting for unseen words and other oddities
- additive smoothing / Accounting for unseen words and other oddities
- Amazon
- Amazon Web Services (AWS)
- about / Using Amazon Web Services
- using / Using Amazon Web Services
- accessing / Using Amazon Web Services
- virtual machines, creating / Creating your first virtual machines
- cluster generation, automating with StarCluster / Automating the generation of clusters with StarCluster
- Anaconda Python distribution
- reference link / Installing Python
- area under curve (AUC) / Looking behind accuracy – precision and recall
- argmax
- about / Using Naïve Bayes to classify
- Associated Press (AP) / Building a topic model
- association rules
- about / Association rule mining
- Auditory Filterbank Temporal Envelope (AFTE) / Improving classification performance with Mel Frequency Cepstral Coefficients
- Automatic Music Genre Classification (AMGC) / Improving classification performance with Mel Frequency Cepstral Coefficients
- AvgSentLen
- about / Designing more features
- AvgWordLen
- about / Designing more features
B
- bag of word approach
- raw text, converting into bag of words / Converting raw text into a bag of words
- words, counting / Counting words
- word count vectors, normalizing / Normalizing word count vectors
- less important words, removing / Removing less important words
- stemming / Stemming
- words, stopping on steroids / Stop words on steroids
- drawbacks / Our achievements and goals
- bag of words model / Local feature representations
- BaseEstimator
- about / Our first estimator
- basket analysis
- about / Basket analysis
- useful predictions, obtaining / Obtaining useful predictions
- supermarket shopping baskets, analyzing / Analyzing supermarket shopping baskets
- association rule mining / Association rule mining
- advanced baskets analysis / More advanced basket analysis
- BernoulliNB
- big data
- about / Learning about big data
- pipeline, breaking into tasks with jug / Using jug to break up your pipeline into tasks
- tasks, introducing in jug / An introduction to tasks in jug
- functioning, of jug / Looking under the hood
- jug, using for data analysis / Using jug for data analysis
- partial results, reusing / Reusing partial results
- binary classification
- blogs, machine learning
- reference links / Blogs
- Body attribute / Preselection and processing of attributes
C
- classes / Learning to classify classy answers
- classification model
- building / Building our first classification model
- data, holding / Evaluation – holding out data and cross-validation
- cross-validation / Evaluation – holding out data and cross-validation
- structure / Building more complex classifiers
- search procedure / Building more complex classifiers
- gain or loss function / Building more complex classifiers
- classifier / Tuning the classifier
- roadmap, sketching / Sketching our roadmap
- classy answers, classifying / Learning to classify classy answers
- data instance, tuning / Tuning the instance
- tuning / Tuning the classifier
- data, fetching / Fetching the data
- creating / Creating our first classifier
- kNN, starting with / Starting with kNN
- features, engineering / Engineering the features
- training / Training the classifier
- performance, measuring / Measuring the classifier's performance
- features, designing / Designing more features
- logistic regression, using / Using logistic regression
- precision, measuring / Looking behind accuracy – precision and recall
- recall, measuring / Looking behind accuracy – precision and recall
- slimming / Slimming the classifier
- serializing / Ship it!
- building, with FFT / Using FFT to build our first classifier
- experimentation agility, increasing / Increasing experimentation agility
- logistic regression classifier, using / Training the classifier
- confusion matrix, using / Using a confusion matrix to measure accuracy in multiclass problems
- performance, measuring with Receiver-Operator Characteristic (ROC) / An alternative way to measure classifier performance using receiver-operator characteristics
- performance, improving with Mel Frequency Cepstrum (MFC) / Improving classification performance with Mel Frequency Cepstral Coefficients
- clustering
- about / Clustering
- hierarchical clustering / Clustering
- k-means / K-means
- testing / Getting test data to evaluate our ideas on
- posts / Clustering posts
- clustering approaches
- reference link / Clustering
- coefficient of determination
- CommentCount / Preselection and processing of attributes
- compactness / Features and feature engineering
- complex classifier
- nearest neighbor classifier / Nearest neighbor classification
- complex classifiers
- building / Building more complex classifiers
- complex dataset
- about / A more complex dataset and a more complex classifier
- Seeds dataset / Learning about the Seeds dataset
- feature engineering / Features and feature engineering
- computer vision
- image processing / Introducing image processing
- local feature representations / Local feature representations
- Coursera
- URL / Online courses
- CreationDate / Preselection and processing of attributes
- cross-validation / Evaluation – holding out data and cross-validation
- cross-validation schedule / Evaluation – holding out data and cross-validation
- Cross Validated
D
- data, classifier
- fetching / Fetching the data
- slimming, to chewable chunks / Slimming the data down to chewable chunks
- attributes, preselecting / Preselection and processing of attributes
- training data, creating / Defining what is a good answer
- data sources, machine learning
- about / Data sources
- dimensionality reduction / Comparing documents by topics
- roadmap, sketching / Sketching our roadmap
- features, selecting / Selecting features
- feature extraction / Feature extraction
- multidimensional scaling / Multidimensional scaling
- documents
- comparing by topics / Comparing documents by topics
E
- Elastic Compute Cluster (EC2) service
- about / Using Amazon Web Services
- ElasticNet model / L1 and L2 penalties
- English-language Wikipedia model
- building / Modeling the whole of Wikipedia
- ensemble learning / Combining multiple methods
- Enthought Canopy
- reference link / Installing Python
F
- F-measure / Tuning the classifier's parameters
- feature engineering / Features and feature engineering
- feature extraction
- about / Feature extraction
- principal component analysis (PCA) / About principal component analysis
- PCA, sketching / Sketching PCA
- PCA, applying / Applying PCA
- PCA, limitations / Limitations of PCA and how LDA can help
- linear discriminant analysis (LDA) / Limitations of PCA and how LDA can help
- features
- about / The Iris dataset
- feature selection / Features and feature engineering
- features selection
- about / Selecting features
- redundant features, detecting with filters / Detecting redundant features using filters
- correlation / Correlation
- mutual information / Mutual information
- model, features asking for / Asking the model about the features using wrappers
- methods / Other feature selection methods
- FFT
- used, for building classifier / Using FFT to build our first classifier
- first tiny application, machine learning
- about / Our first (tiny) application of machine learning
- data, reading in / Reading in the data
- data, preprocessing / Preprocessing and cleaning the data
- data, cleaning / Preprocessing and cleaning the data
- model, selecting / Choosing the right model and learning algorithm, Before building our first model…
- learning algorithm, selecting / Choosing the right model and learning algorithm
- fit(document, y=None) method
- about / Our first estimator
- free tier
- about / Using Amazon Web Services
G
- GaussianNB
- get_feature_names() method
- about / Our first estimator
- Grid Engine / Using jug to break up your pipeline into tasks
- GridSearchCV
H
- hierarchical clustering
- about / Clustering
- hierarchical Dirichlet (HDP) process / Choosing the number of topics
- house prices, predicting with regression
- about / Predicting house prices with regression
- multidimensional regression / Multidimensional regression
- cross-validation, for regression / Cross-validation for regression
I
- image processing
- about / Introducing image processing
- images, loading / Loading and displaying images
- images, displaying / Loading and displaying images
- thresholding / Thresholding
- Gaussian blurring / Gaussian blurring
- center, putting in focus / Putting the center in focus
- basic image classification / Basic image classification
- features, computing from images / Computing features from images
- custom features, writing / Writing your own features
- features, used for finding similar images / Using features to find similar images
- harder dataset, classifying / Classifying a harder dataset
- improvement, classifier
- steps / Deciding how to improve
- bias-variance / Bias-variance and their tradeoff
- high bias, fixing / Fixing high bias
- high variance, fixing / Fixing high variance
- high bias / High bias or low bias
- high variance problem, hinting / High bias or low bias
- initial challenge
- solving / Solving our initial challenge
- impression of noise example / Another look at noise
- instance / Creating your first virtual machines
- International Society for Music Information Retrieval (ISMIR) / Improving classification performance with Mel Frequency Cepstral Coefficients
- inverse document frequency (TF-IDF) / Stop words on steroids
- Iris dataset
- about / The Iris dataset
- features / The Iris dataset
- visualization / Visualization is a good first step
- classification model, building / Building our first classification model
J
- jug
- working / Looking under the hood
- using, for data analysis / Using jug for data analysis
- online documentation / Reusing partial results
- running, on cloud machine / Running jug on our cloud machine
- jug cleanup
- about / Reusing partial results
- jug invalidate
- about / Reusing partial results
- jug status --cache
- about / Reusing partial results
K
- k-means
- about / K-means
- Kaggle
L
- labels / Learning to classify classy answers
- Laplace smoothing / Accounting for unseen words and other oddities
- Lasso / L1 and L2 penalties
- latent Dirichlet allocation (LDA)
- about / Latent Dirichlet allocation
- Wikipedia URL / Latent Dirichlet allocation
- topic model, building / Building a topic model
- lift
- about / Association rule mining
- linear discriminant analysis (LDA) / Sketching our roadmap
- local feature representations
- about / Local feature representations
- logistic regression
- about / Using logistic regression
- using / Using logistic regression
- example / A bit of math with a small example
- applying, to post classification problem / Applying logistic regression to our post classification problem
- LSF (Load Sharing Facility) / Using jug to break up your pipeline into tasks
M
- machine learning
- about / Machine learning and Python – a dream team
- first tiny application / Our first (tiny) application of machine learning
- machine learning algorithm
- Machine Learning Toolkit (Milk)
- URL / All that was left out
- matplotlib
- matshow() function / Using a confusion matrix to measure accuracy in multiclass problems
- MDP toolkit
- URL / All that was left out
- Mel Frequency Cepstrum (MFC)
- used, for improving classification performance / Improving classification performance with Mel Frequency Cepstral Coefficients
- MetaOptimize
- MLComp
- model, first tiny application
- selecting / Before building our first model…
- straight line model / Starting with a simple straight line
- complex model / Towards some advanced stuff
- data, viewing / Stepping back to go forward – another look at our data
- training / Training and testing
- testing / Training and testing
- model function, calculating / Answering our initial question
- mpmath
- multiclass classification
- multidimensional regression
- about / Multidimensional regression
- using / Multidimensional regression
- multidimensional scaling (MDS) / Sketching our roadmap
- about / Multidimensional scaling
- MultinomialNB
- MultinomialNB classifier / Tuning the classifier's parameters
- music
- analyzing / Looking at music
- decomposing, into sine wave components / Decomposing music into sine wave components
- music data
- fetching / Fetching the music data
- wave format, converting into / Converting into a WAV format
N
- Natural Language Toolkit (NLTK) / Stemming
- installing / Installing and using NLTK
- URL / Installing and using NLTK
- vectorizer, extending with / Extending the vectorizer with NLTK's stemmer
- Naïve Bayes
- about / Sketching our roadmap
- Naïve Bayes classifier
- about / Introducing the Naïve Bayes classifier
- Naïve Bayes theorem / Getting to know the Bayes' theorem
- working / Being naïve
- using, to classify / Using Naïve Bayes to classify
- unseen words, accounting for / Accounting for unseen words and other oddities
- arithmetic underflows, accounting for / Accounting for arithmetic underflows
- GaussianNB / Creating our first classifier and tuning it
- MultinomialNB / Creating our first classifier and tuning it
- BernoulliNB / Creating our first classifier and tuning it
- problem, solving / Solving an easy problem first
- classes, using / Using all classes
- parameters, tuning / Tuning the classifier's parameters
- nearest neighbor classifier
- about / Nearest neighbor classification
- neighborhood approach, recommendations
- NumAllCaps
- about / Designing more features
- NumExclams
- about / Designing more features
- NumPy
- about / Introduction to NumPy, SciPy, and matplotlib
- examples / Chewing data efficiently with NumPy and intelligently with SciPy
- reference link, for examples / Chewing data efficiently with NumPy and intelligently with SciPy
- learning / Learning NumPy
- indexing / Indexing
- nonexisting values, handling / Handling nonexisting values
- runtime, comparing / Comparing the runtime
O
- one-dimensional regression
- online course, machine learning
- URL / Online courses
- Otsu / Thresholding
- overfitting
- about / Towards some advanced stuff
- OwnerUserId / Preselection and processing of attributes
P
- parameters, clustering
- tweaking / Tweaking the parameters
- Part Of Speech (POS) / Sketching our roadmap
- Pattern
- URL / All that was left out
- PBS (Portable Batch System) / Using jug to break up your pipeline into tasks
- penalized regression
- about / Penalized or regularized regression
- L1 penalties / L1 and L2 penalties
- L2 penalties / L1 and L2 penalties
- Lasso, using in scikit-learn / Using Lasso or ElasticNet in scikit-learn
- ElasticNet, using in scikit-learn / Using Lasso or ElasticNet in scikit-learn
- Lasso path, visualizing / Visualizing the Lasso path
- P greater than N scenarios / P-greater-than-N scenarios
- example, text documents / An example based on text documents
- hyperparameters, setting in principled way / Setting hyperparameters in a principled way
- Penn Treebank Project
- POS column
- POS tag abbreviations / Determining the word types
- PostTypeId attribute / Preselection and processing of attributes
- pre-processing phase
- achievements / Our achievements and goals
- goals / Our achievements and goals
- precision-recall (P/R) / An alternative way to measure classifier performance using receiver-operator characteristics
- precision_recall_curve() function / Looking behind accuracy – precision and recall
- predictions, rating with regression
- about / Rating predictions and recommendations
- dataset, splitting into training and testing / Splitting into training and testing
- training data, normalizing / Normalizing the training data
- preprocessing
- principal component analysis (PCA) / Sketching our roadmap
- about / About principal component analysis
- properties / About principal component analysis
- sketching / Sketching PCA
- applying / Applying PCA
- limitations / Limitations of PCA and how LDA can help
- PyBrain
- URL / All that was left out
- Python
- installing / Installing Python
- reference link / Installing Python
- Python packages
- installing, on Amazon Linux / Installing Python packages on Amazon Linux
Q
- Q&A sites
- MetaOptimize / What to do when you are stuck
- Cross Validated / What to do when you are stuck
- Stack Overflow / What to do when you are stuck
- TwoToReal / What to do when you are stuck
- Kaggle / What to do when you are stuck
R
- Receiver-Operator Characteristic (ROC)
- used, for measuring classifier performance / An alternative way to measure classifier performance using receiver-operator characteristics
- about / An alternative way to measure classifier performance using receiver-operator characteristics
- recommendations
- neighborhood approach / A neighborhood approach to recommendations
- regression approach / A regression approach to recommendations
- multiple methods, combining / Combining multiple methods
- regression
- cross-validation / Cross-validation for regression
- about / L1 and L2 penalties
- regression approach, recommendations
- resources, machine learning
- online courses / Online courses
- books / Books
- question and answer sites / Question and answer sites
- blogs / Blogs
- data sources / Data sources
- competition / Getting competitive
- Ridge Regression / L1 and L2 penalties
- roadmap
- sketching / Sketching our roadmap
- root mean square error (RMSE)
- about / Predicting house prices with regression
- advantage / Predicting house prices with regression
- roundness / Features and feature engineering
- running status / Creating your first virtual machines
S
- save() function / Increasing experimentation agility
- scikit-learn classification
- about / Classifying with scikit-learn
- decision boundaries, examining / Looking at the decision boundaries
- scikit-learn module
- about / Classifying with scikit-learn
- SciPy
- about / Introduction to NumPy, SciPy, and matplotlib
- URL / Introduction to NumPy, SciPy, and matplotlib
- learning / Learning SciPy
- toolboxes / Learning SciPy
- secret key
- about / Using Amazon Web Services
- Securities and Exchange Commission (SEC) / An example based on text documents
- Seeds dataset
- about / Learning about the Seeds dataset
- features / Learning about the Seeds dataset
- sentiment analysis
- roadmap, sketching / Sketching our roadmap
- Twitter data, fetching / Fetching the Twitter data
- Naïve Bayes classifier / Introducing the Naïve Bayes classifier
- first classifier, creating / Creating our first classifier and tuning it
- tweets, cleaning / Cleaning tweets
- SentiWordNet
- similarity measuring
- about / Measuring the relatedness of posts
- bag of word approach / How to do it
- SoX
- sparse
- about / L1 and L2 penalties
- sparsity / Building a topic model
- specgram function / Looking at music
- Speeded Up Robust Features (SURF)
- about / Local feature representations
- stacked learning / Combining multiple methods
- Stack Overflow
- StarCluster
- used, for automating cluster generation / Automating the generation of clusters with StarCluster
- about / Automating the generation of clusters with StarCluster
- URL / Automating the generation of clusters with StarCluster
- stemming
- about / Stemming
T
- Talkbox SciKit
- task
- about / An introduction to tasks in jug
- testing accuracy / Evaluation – holding out data and cross-validation
- TfidfVectorizer parameter / Tuning the classifier's parameters
- thresholding
- about / Thresholding
- TimeToAnswer / Engineering the features
- Title attribute / Preselection and processing of attributes
- toolboxes, SciPy
- cluster / Learning SciPy
- constants / Learning SciPy
- fftpack / Learning SciPy
- integrate / Learning SciPy
- interpolate / Learning SciPy
- io / Learning SciPy
- linalg / Learning SciPy
- ndimage / Learning SciPy
- odr / Learning SciPy
- optimize / Learning SciPy
- signal / Learning SciPy
- sparse / Learning SciPy
- spatial / Learning SciPy
- special / Learning SciPy
- stats / Learning SciPy
- topics
- documents comparing by / Comparing documents by topics
- number of topics, selecting / Choosing the number of topics
- training accuracy / Evaluation – holding out data and cross-validation
- train_model()function
- about / Solving an easy problem first
- transform(documents) method
- about / Our first estimator
- tweets
- cleaning / Cleaning tweets
- Twitter data
- fetching / Fetching the Twitter data
- two-levels of cross-validation / Setting hyperparameters in a principled way
- TwoToReal
U
- underfitting
V
- ViewCount / Preselection and processing of attributes
- virtual machines, Amazon Web Services (AWS)
- creating / Creating your first virtual machines
- Python packages, installing on Amazon Linux / Installing Python packages on Amazon Linux
- jug, running on cloud machine / Running jug on our cloud machine
- visual words / Local feature representations
W
- Wikipedia dump
- word types
- about / Taking the word types into account
- determining / Determining the word types
- estimator / Our first estimator
- implementing / Putting everything together