Index
A
- access keys
- accuracy
- improving, dictionary used / Improving accuracy using a dictionary
- activation function
- about / Artificial neural networks
- Adult dataset
- Advertisements dataset
- URL / Feature creation
- affinity analysis
- example / A simple affinity analysis example
- defining / What is affinity analysis?
- product recommendations / Product recommendations
- dataset, loading with NumPy / Loading the dataset with NumPy
- ranking of rules, implementing / Implementing a simple ranking of rules
- ranking, to find best rules / Ranking to find the best rules
- about / Affinity analysis
- algorithms / Algorithms for affinity analysis
- parameters, selecting / Choosing parameters
- Amazon S3 console
- API endpoint
- application
- defining / Application
- word counts, extracting / Extracting word counts
- dictionaries, converting to matrix / Converting dictionaries to a matrix
- Naive Bayes classifier, training / Training the Naive Bayes classifier
- about / Application, Application
- data, obtaining / Getting the data, Getting the data
- neural network, creating / Creating the neural network
- neural network, training with training dataset / Putting it all together
- Naive Bayes algorithm / Naive Bayes prediction
- apps, Twitter account
- Apriori algorithm / The Apriori algorithm
- Apriori implementation
- about / The Apriori implementation
- Apriori algorithm / The Apriori algorithm
- defining / Implementation
- arbitrary websites
- text, extracting from / Extracting text from arbitrary websites, Putting it all together
- stories, finding / Finding the stories in arbitrary websites
- data mining, using / Putting it all together
- nodes, ignoring / Putting it all together
- HTML file, parsing / Putting it all together
- Artificial Neural Networks
- about / Artificial neural networks
- association rules
- extracting / Extracting association rules
- evaluating / Evaluation
- authorship analysis
- defining / Attributing documents to authors
- applications / Applications and use cases
- use cases / Applications and use cases
- about / Applications and use cases
- authorship attribution / Attributing authorship
- data, obtaining / Getting the data
- authorship analysis, problems
- authorship profiling / Attributing documents to authors
- authorship verification / Attributing documents to authors
- authorship clustering / Attributing documents to authors
- authorship attribution / Attributing authorship
- AWS CLI
- installing / Training on Amazon's EMR infrastructure
- AWS console
B
- back propagation (backprop) algorithm / Back propagation
- bagging
- about / Random forests
- BatchIterator instance
- creating / Creating the neural network
- Bayes' theorem / Bayes' theorem
- about / Bayes' theorem
- equation / Bayes' theorem
- bias
- about / How do ensembles work?
- big data
- about / Big data
- use cases / Application scenario and goals
- Bleeding Edge code
- installing / Scalability with the nearest neighbor
- URL / Scalability with the nearest neighbor
- blog posts
- extracting / Extracting the blog posts
- blogs dataset
- about / Blogs dataset
C
- CAPTCHA
- creating / Drawing basic CAPTCHAs
- CAPTCHAs
- references / Better (worse?) CAPTCHAs
- defining / Better (worse?) CAPTCHAs
- CART (Classification and Regression Trees)
- about / Decision trees
- character n-grams
- about / Character n-grams
- extracting / Extracting character n-grams
- CIFAR-10
- class
- about / A simple classification example
- classification
- example / A simple classification example
- about / What is classification?
- examples / What is classification?
- dataset, loading / Loading and preparing the dataset
- dataset, preparing / Loading and preparing the dataset
- OneR algorithm, implementing / Implementing the OneR algorithm
- algorithm, testing / Testing the algorithm
- classifiers
- comparing / Comparing classifiers
- closed problem
- about / Attributing authorship
- cluster evaluation
- URL / Evaluating the results
- clustering
- about / Grouping news articles
- coassociation matrix
- defining / Evidence accumulation
- complex algorithms
- references / More complex algorithms
- complex features
- references / More complex features
- confidence
- about / Implementing a simple ranking of rules
- computing / Implementing a simple ranking of rules
- connected components
- about / Connected components
- Cosine distance
- about / Distance metrics
- Coursera
- about / More resources
- references / More resources
- Coval font I, Open Font Library
- URL / Drawing basic CAPTCHAs
- CPU
- defining / When to use GPUs for computation
- cross-fold validation framework
- defining / Running the algorithm
- CSV (Comma Separated Values)
- about / Collecting the data
D
- data, blogging
- URL / Getting the data
- data, Corpus
- URL / Getting the data
- Dataframe
- about / Using pandas to load the dataset
- data mining
- defining / Introducing data mining
- dataset
- loading / Loading the dataset, Loading the dataset, An introduction to Lasagne
- data, collecting / Collecting the data
- URL / Collecting the data
- loading, pandas used / Using pandas to load the dataset
- cleaning up / Cleaning up the dataset
- new features, extracting / Extracting new features
- classifying, with existing model / Classifying with an existing model
- follower information, obtaining from Twitter / Getting follower information from Twitter
- network, building / Building the network
- graph, creating / Creating a graph
- Similarity graph, creating / Creating a similarity graph
- creating / Creating the dataset
- CAPTCHAs, drawing / Drawing basic CAPTCHAs
- image, splitting into individual letters / Splitting the image into individual letters
- training dataset, creating / Creating a training dataset
- training dataset, adjusting to methodology / Adjusting our training dataset to our methodology
- datasets
- about / Introducing data mining
- samples / Introducing data mining
- features / Introducing data mining
- example / Introducing data mining
- URL / Obtaining the dataset, Extending the IPython Notebook
- references / New datasets
- decision tree implementation
- min_samples_split / Parameters in decision trees
- min_samples_leaf / Parameters in decision trees
- decision trees
- about / Decision trees
- parameters / Parameters in decision trees
- Gini impurity / Parameters in decision trees
- Information gain / Parameters in decision trees
- using / Using decision trees
- dictionary
- used, for improving accuracy / Improving accuracy using a dictionary
- ranking mechanisms, for words / Ranking mechanisms for words
- improved prediction function, testing / Putting it all together
- DictVectorizer class
- disambiguation
- about / Disambiguation
- data, downloading from social network / Downloading data from a social network
- dataset, loading / Loading and classifying the dataset
- dataset, classifying / Loading and classifying the dataset
- replicable dataset, creating from Twitter / Creating a replicable dataset from Twitter
- discretization
- about / Common feature patterns
- discretization algorithm
- defining / Loading and preparing the dataset
- documents
- attributing, to authors / Attributing documents to authors
E
- EC2 service console
- Eclat algorithm
- about / Algorithms for affinity analysis
- URL / The Eclat algorithm
- implementing / The Eclat algorithm
- Elastic Map Reduce (EMR)
- Enron dataset
- using / Using the Enron dataset
- accessing / Accessing the Enron dataset
- URL / Accessing the Enron dataset
- dataset loader, creating / Creating a dataset loader
- existing parameter space, using / Putting it all together
- classifier, using / Putting it all together
- evaluation / Evaluation
- ensembles
- clustering / Clustering ensembles
- evidence accumulation / Evidence accumulation
- working / How it works
- implementing / Implementation
- environment
- setting up / Setting up the environment
- epochs
- about / Back propagation
- Euclidean distance
- about / Distance metrics
- evaluation, of clustering algorithms
- references / Evaluation
- Evidence Accumulation Clustering (EAC)
- about / Evidence accumulation
- defining / Evidence accumulation
- Excel, pandas
- URL / More on pandas
F
- f1-score
- about / Evaluation using the F1-score
- computing / Evaluation using the F1-score
- using / Evaluation using the F1-score
- feature-based normalization
- about / Standard preprocessing
- feature creation
- about / Feature creation
- Principal Component Analysis (PCA) / Principal Component Analysis
- feature extraction
- about / Feature extraction
- reality, representing in models / Representing reality in models
- common feature patterns / Common feature patterns
- good features, creating / Creating good features
- features, dataset
- URL / More complex pipelines
- feature selection
- about / Feature selection
- best individual features, selecting / Selecting the best individual features
- feed-forward neural network
- filename, data
- Blogger ID / Getting the data
- Gender / Getting the data
- Age / Getting the data
- Industry / Getting the data
- Star Sign / Getting the data
- FP-growth algorithm
- about / Algorithms for affinity analysis
- frequent itemsets
- about / Algorithms for affinity analysis
- functions, transformer
- fit() / The transformer API
- transform() / The transformer API
- function words
- about / Function words
- counting / Counting function words
- classifying with / Classifying with function words
G
- GPU
- using, for computation / When to use GPUs for computation
- benefits / When to use GPUs for computation
- avenues, defining / When to use GPUs for computation
- code, running on / Running our code on a GPU
- GPU optimization
- about / GPU optimization
- graph
- creating / Creating a graph
- gzip
- about / Accessing the Enron dataset
H
- Hadoop
- about / Hadoop MapReduce
- Distributed File System (HDFS) / Hadoop MapReduce
- YARN / Hadoop MapReduce
- Pig / Hadoop MapReduce
- Hive / Hadoop MapReduce
- HBase / Hadoop MapReduce
- courses / Courses on Hadoop
- Hadoop MapReduce
- about / Hadoop MapReduce
- hash function
- hidden layer
- about / An introduction to neural networks
- creating / An introduction to Lasagne
- hierarchical clustering
- about / Evidence accumulation
I
- image
- extracting / Application scenario and goals
- image datasets
- URL / Mahotas
- input layer
- installation instructions, scikit-learn
- URL / Installing scikit-learn
- instructions, AWS CLI
- intra-cluster distance
- about / Optimizing criteria
- Ionosphere
- about / Loading the dataset
- URL / Loading the dataset
- Ionosphere Nearest Neighbor
- about / Loading the dataset
- IPython
- installing / Installing IPython
- URL / Installing IPython
- IPython Notebook
- creating / Downloading data from a social network
- URL / Extending the IPython Notebook
- IPython notebook
- Iris Setosa / Loading and preparing the dataset
- Iris Versicolour / Loading and preparing the dataset
- Iris Virginica / Loading and preparing the dataset
J
- Jaccard Similarity
- about / Creating a similarity graph
- JQuery library
- JSON
- about / Loading and classifying the dataset
- and dataset, comparing / Loading and classifying the dataset
K
- k-means algorithm
- about / The k-means algorithm
- assignment phase / The k-means algorithm
- updating phase / The k-means algorithm
- Kaggle
- URL / More resources
- about / More resources
- karma
- about / Reddit as a data source
- Keras
- URL / Keras and Pylearn2
- kernel
- kernel parameter
- about / Kernels
- kernels / Kernels
L
- Lasagne
- about / An introduction to Lasagne
- URL / An introduction to Lasagne
- Levenshtein edit distance
- about / Ranking mechanisms for words
- computing / Ranking mechanisms for words
- Locality-Sensitive Hashing (LSH)
- local n-grams
- references / Local n-grams
- about / Local n-grams
- local optima
- about / Back propagation
- log probabilities
- using / Putting it all together
M
- machine-learning workflow
- training / Testing the algorithm
- testing / Testing the algorithm
- Mahotas
- Manhattan distance
- about / Distance metrics
- MapReduce
- about / MapReduce
- defining / Intuition
- WordCount example / A word count example
- Hadoop MapReduce / Hadoop MapReduce
- matplotlib
- URL / scikit-learn estimators
- MD5 algorithm
- metadata
- about / Disambiguation
- MiniBatchKMeans
- about / Implementation
- Minimum Spanning Tree (MST)
- about / Evidence accumulation
- computing / Evidence accumulation
- movie recommendation problem
- about / The movie recommendation problem
- dataset, obtaining / Obtaining the dataset
- loading, with pandas / Loading with pandas
- sparse data formats / Sparse data formats
- mrjob
- mrjob package / The mrjob package
- multiple SVMs
- creating / Classifying with SVMs
N
- n-gram
- about / Character n-grams
- n-grams
- Naive Bayes
- about / Naive Bayes
- Bayes' theorem / Bayes' theorem
- algorithm / Naive Bayes algorithm
- working / How it works
- Naive Bayes algorithm
- mrjob package / The mrjob package
- blog posts, extracting / Extracting the blog posts
- Naive Bayes model, training / Training Naive Bayes
- classifier, running / Putting it all together
- Amazon's EMR infrastructure, training / Training on Amazon's EMR infrastructure
- Naive Bayes model
- training / Training Naive Bayes
- NaN (Not a Number)
- about / Feature creation
- National Basketball Association (NBA)
- about / Loading the dataset
- URL / Collecting the data
- Natural Language ToolKit (NLTK)
- about / Bag-of-words
- nearest neighbor
- about / scikit-learn estimators
- nearest neighbor algorithm
- Nearest neighbors
- about / Nearest neighbors
- network
- building / Building the network
- networks
- defining / Deeper networks
- NetworkX
- URL / Creating a similarity graph, NetworkX
- defining / NetworkX
- NetworkX package
- about / Creating a graph
- neural network
- training / Training and classifying
- classifying / Training and classifying
- back propagation (backprop) algorithm / Back propagation
- words, predicting / Predicting words
- neural network layers, Lasagne
- network-in-network layers / An introduction to Lasagne
- dropout layers / An introduction to Lasagne
- noise layers / An introduction to Lasagne
- Neural networks
- neural networks
- about / scikit-learn estimators, Artificial neural networks, Deep neural networks
- training / Deep neural networks
- defining / Intuition
- implementing / Implementation
- Theano, defining / An introduction to Theano
- Lasagne, defining / An introduction to Lasagne
- implementing, with nolearn / Implementing neural networks with nolearn
- URL / More resources
- neurons
- about / Artificial neural networks
- news articles
- obtaining / Obtaining news articles
- web API used, for obtaining data / Using a Web API to get data
- Reddit, as data source / Reddit as a data source
- data, obtaining / Getting the data
- clustering / Grouping news articles
- k-means algorithm / The k-means algorithm
- results, evaluating / Evaluating the results
- topic information, extracting from clusters / Extracting topic information from clusters
- clustering algorithms, using as transformers / Using clustering algorithms as transformers
- NLTK
- NLTK installation instructions
- URL / Application
- noise
- adding / Adding noise
- nolearn package
- neural networks, implementing with / Implementing neural networks with nolearn
- nonprogrammers, for Python language
- URL / Installing Python
- n_neighbors
- about / Setting parameters
O
- object classification
- about / Object classification
- one-versus-all classifier
- creating / Classifying with SVMs
- OneR
- about / Implementing the OneR algorithm
- online learning
- about / Online learning
- defining / An introduction to online learning
- implementing / Implementation
- ordinal
- about / Common feature patterns
- output layer
- overfitting
- about / Testing the algorithm
P
- pagination
- pandas
- URL / Collecting the data, More on pandas
- references / More on pandas
- pandas (Python Data Analysis)
- about / Collecting the data
- pandas.read_csv function
- about / Cleaning up the dataset
- pandas documentation
- URL / Engineering new features
- parameters, ensemble process
- n_estimators / Parameters in Random forests
- oob_score / Parameters in Random forests
- n_jobs / Parameters in Random forests
- petal length / Loading and preparing the dataset
- petal width / Loading and preparing the dataset
- pip
- about / Installing Python, Creating a graph
- Pipeline
- creating / Putting it all together
- pipeline
- creating / Application
- NLTKBOW transformer / Putting it all together
- DictVectorizer transformer / Putting it all together
- BernoulliNB classifier / Putting it all together
- pipelines
- about / Pipelines
- Pipelines
- URL / More complex pipelines
- precision
- about / Evaluation using the F1-score
- preprocessing, using pipelines
- about / Preprocessing using pipelines
- features / Preprocessing using pipelines
- features, of animal / Preprocessing using pipelines
- example / An example
- standard preprocessing / Standard preprocessing
- workflow, creating / Putting it all together
- pricing alerts
- Principal Component Analysis (PCA)
- about / Principal Component Analysis
- prior belief
- about / Bayes' theorem
- probabilistic graphical models
- URL / More resources
- probabilities
- computing / Putting it all together
- programmers, for Python language
- URL / Installing Python
- Project Gutenberg
- URL / Getting the data
- Pydoop
- Pylearn2
- about / Keras and Pylearn2
- URL / Keras and Pylearn2
- Python
- using / Using Python and the IPython Notebook
- installing / Installing Python
- URL / Installing Python
- defining / Disambiguation
- Python 3.4
- about / Installing Python
Q
- quotequail package
- about / Creating a dataset loader
R
- RandomForestClassifier
- about / Parameters in Random forests
- random forests
- about / scikit-learn estimators
- defining / Random forests
- ensembles, working / How do ensembles work?
- parameters / Parameters in Random forests
- applying / Applying Random forests
- new features, engineering / Engineering new features
- README
- about / Extracting association rules
- real-time clusterings
- about / Real-time clusterings
- reasons, feature selection
- complexity, reducing / Feature selection
- noise, reducing / Feature selection
- readable models, creating / Feature selection
- recall
- about / Evaluation using the F1-score
- recommendation engine
- building / Recommendation engine
- URL / Recommendation engine
- reddit
- about / Obtaining news articles, Using a Web API to get data
- references / Using a Web API to get data
- Reddit
- about / Reddit as a data source
- URL / Reddit as a data source
- regularization
- reinforcement learning
- URL / Reinforcement learning
- RESTful interface (Representational State Transfer)
- about / Using a Web API to get data
- rules
- support / Implementing a simple ranking of rules
- confidence / Implementing a simple ranking of rules
- finding / Ranking to find the best rules
S
- sample size
- increasing / Increasing the sample size
- scikit-learn
- installing / Installing scikit-learn
- URL / Installing scikit-learn
- scikit-learn estimators
- defining / scikit-learn estimators
- fit() / scikit-learn estimators
- predict() / scikit-learn estimators
- Nearest neighbors / Nearest neighbors
- distance metrics / Distance metrics
- dataset, loading / Loading the dataset
- standard workflow, defining / Moving towards a standard workflow
- fit() function / Moving towards a standard workflow
- predict() function / Moving towards a standard workflow
- algorithm, running / Running the algorithm
- parameters, setting / Setting parameters
- scikit-learn package
- references / Evaluation
- Scikit-learn tutorials
- URL / Scikit-learn tutorials
- self-posts
- about / Reddit as a data source
- sepal length / Loading and preparing the dataset
- sepal width / Loading and preparing the dataset
- shapes adding, CAPTCHAs
- URL / Better (worse?) CAPTCHAs
- Silhouette Coefficient
- about / Optimizing criteria
- computing / Optimizing criteria
- parameters / Optimizing criteria
- Similarity graph
- creating / Creating a similarity graph
- SNAP
- URL / NetworkX
- softmax nonlinearity
- about / An introduction to Lasagne
- Spam detection
- references / Spam detection
- spam filter
- about / Evaluation using the F1-score
- sparse matrix
- about / Distance metrics
- sparse matrix format
- about / Sparse data formats
- sports outcome prediction
- about / Sports outcome prediction
- features / Sports outcome prediction
- stacking
- about / Putting it all together
- StackOverflow question
- URL / More on pandas
- standings
- loading / Putting it all together
- standings data
- obtaining / Putting it all together
- URL / Putting it all together
- Stratified K Fold
- about / Running the algorithm
- style sheets
- stylometry
- about / Attributing documents to authors
- subgraphs
- finding / Finding subgraphs
- connected components / Connected components
- criteria, optimizing / Optimizing criteria
- subreddits
- support / Implementing a simple ranking of rules
- support vector machines (SVM)
- about / scikit-learn estimators
- SVMs
- about / Support vector machines
- URL / Support vector machines
- classifying with / Classifying with SVMs
- kernels / Kernels
- system
- building, for taking image as input / Application scenario and goals
T
- temporal analysis
- about / Temporal analysis
- text
- about / Disambiguation
- extracting, from arbitrary websites / Extracting text from arbitrary websites
- text transformers
- defining / Text transformers
- word, counting in dataset / Bag-of-words
- bag-of-words model / Bag-of-words
- n-grams / N-grams
- features / Other features
- tf-idf
- about / Bag-of-words
- Theano
- about / An introduction to Theano
- using / An introduction to Theano
- URL / Running our code on a GPU
- Torch
- URL / Keras and Pylearn2
- train_feature_value() function
- about / Implementing the OneR algorithm
- transformer
- creating / Creating your own transformer
- API / The transformer API
- implementing / Implementation details
- unit testing / Unit testing
- tutorial, Google
- URL / Courses on Hadoop
- tutorial, Yahoo
- URL / Courses on Hadoop
- tweet
- about / Disambiguation
- tweets
- loading / Putting it all together
- F1-score, used for evaluation / Evaluation using the F1-score
- features, obtaining from models / Getting useful features from models
- Twitter
- follower information, obtaining from / Getting follower information from Twitter
- Twitter account
- twitter documentation
U
- UCL Machine Learning data repository
- URL / Loading the dataset
- univariate feature
- unstructured format
- about / Disambiguation
- use cases, computer vision
- about / Use cases
V
- V's, big data
- variance
- virtualenv
- vocabulary
- about / Counting function words
- Vowpal Wabbit
- about / Vowpal Wabbit
- URL / Vowpal Wabbit
W
- web-based API, considerations
- authorization methods / Using a Web API to get data
- rate limiting / Using a Web API to get data
- API Endpoints / Using a Web API to get data
- weight
- weighted edge
- about / Creating a similarity graph
Z
- 7-zip