Book Image

Learning Data Mining with Python

Book Image

Learning Data Mining with Python

Overview of this book

Table of Contents (20 chapters)
Learning Data Mining with Python
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Chapter 2 – Classifying with scikit-learn Estimators


Scalability with the nearest neighbor

https://github.com/jnothman/scikit-learn/tree/pr2532

A naïve implementation of the nearest neighbor algorithm is quite slow—it checks all pairs of points to find those that are close together. Better implementations exist, with some implemented in scikit-learn. For instance, a kd-tree can be created that speeds up the algorithm (and this is already included in scikit-learn).

Another way to speed up this search is to use locality-sensitive hashing, Locality-Sensitive Hashing (LSH). This is a proposed improvement for scikit-learn, and hasn't made it into the package at the time of writing. The above link gives a development branch of scikit-learn that will allow you to test out LSH on a dataset. Read through the documentation attached to this branch for details on doing this.

To install it, clone the repository and follow the instructions to install the Bleeding Edge code available at: http://scikit-learn.org/stable/install.html.

Remember to use the above repository's code, rather than the official source. I recommend you install using virtualenv or a virtual machine, rather than installing it directly on your computer. A great guide to virtualenv can be found here: http://docs.python-guide.org/en/latest/dev/virtualenvs/.

More complex pipelines

http://scikit-learn.org/stable/modules/pipeline.html#featureunion-composite-feature-spaces

The Pipelines we have used in the book follow a single stream—the output of one step is the input of another step.

Pipelines follow the transformer and estimator interfaces as well—this allows us to embed Pipelines within Pipelines. This is a useful construct for very complex models, but becomes very powerful when combined with Feature Unions, as shown in the above link.

This allows us to extract multiple types of features at a time and then combine them to form a single dataset. For more details, see the example at: http://scikit-learn.org/stable/auto_examples/feature_stacker.html.

Comparing classifiers

There are lots of classifiers in scikit-learn that are ready to use. The one you choose for a particular task is going to be based on a variety of factors. You can compare the f1-score to see which method is better, and you can investigate the deviation of those scores to see if that result is statistically significant.

An important factor is that they are trained and tested on the same data—that is, the test set for one classifier is the test set for all classifiers. Our use of random states allows us to ensure this is the case—an important factor for replicating experiments.