Machine learning is a subfield of artificial intelligence that explores how machines can learn from data to analyze structures, help with decisions, and make predictions. In 1959, Arthur Samuel defined machine learning as the, "Field of study that gives computers the ability to learn without being explicitly programmed."
Each sample might consist of single or multiple values. In the context of machine learning, the properties of data are called features.
Machine learning can be arranged by the nature of the input data:
In supervised learning, the input data (typically denoted with x) is associated with a target label (y), whereas in unsupervised learning, we only have unlabeled input data.
Supervised learning can be further broken down into the following problems:
Classification problems have a fixed set of target labels, classes, or categories, while regression problems have one or more continuous output variables. Classifying e-mail messages as spam or not spam is a classification task with two target labels. Predicting house prices—given the data about houses, such as size, age, and nitric oxides concentration—is a regression task, since the price is continuous.
The scikit-learn library is organized into submodules. Each submodule contains algorithms and helper methods for a certain class of machine learning models and approaches.
Here is a sample of those submodules, including some example models:
Submodule |
Description |
Example models |
---|---|---|
cluster |
This is the unsupervised clustering |
KMeans and Ward |
decomposition |
This is the dimensionality reduction |
PCA and NMF |
ensemble |
This involves ensemble-based methods |
AdaBoostClassifier, AdaBoostRegressor, RandomForestClassifier, RandomForestRegressor |
lda |
This stands for latent discriminant analysis |
LDA |
linear_model |
This is the generalized linear model |
LinearRegression, LogisticRegression, Lasso and Perceptron |
mixture |
This is the mixture model |
GMM and VBGMM |
naive_bayes |
This involves supervised learning based on Bayes' theorem |
BaseNB and BernoulliNB, GaussianNB |
neighbors |
These are k-nearest neighbors |
KNeighborsClassifier, KNeighborsRegressor, LSHForest |
neural_network |
This involves models based on neural networks |
BernoulliRBM |
tree |
decision trees |
DecisionTreeClassifier, DecisionTreeRegressor |
While these approaches are diverse, a scikit-learn library abstracts away a lot of differences by exposing a regular interface to most of these algorithms. All of the example algorithms listed in the table implement a fit
method, and most of them implement predict as well. These methods represent two phases in machine learning. First, the model is trained on the existing data with the fit
method. Once trained, it is possible to use the model to predict the class or value of unseen data with predict. We will see both the methods at work in the next sections.
In contrast to the heterogeneous domains and applications of machine learning, the data representation in scikit-learn is less diverse, and the basic format that many algorithms expect is straightforward—a matrix of samples and features.
There is something like Hello World
in the world of machine learning datasets as well; for example, the Iris dataset whose origins date back to 1936. With the standard installation of scikit-learn, you already have access to a couple of datasets, including Iris that consists of 150 samples, each consisting of four measurements taken from three different Iris flower species:
The dataset is packaged as a bunch, which is only a thin wrapper around a dictionary:
Under the data
key, we can find the matrix of samples and features, and can confirm its shape:
The target names are encoded. We can look up the corresponding names in the target_names
attribute:
This is the basic anatomy of many datasets, such as example data, target values, and target names.
What are the features of a single entry in this dataset?:
The four features are the measurements taken of real flowers: their sepal length and width, and petal length and width. Three different species have been examined: the Iris-Setosa, Iris-Versicolour, and Iris-Virginica.
Machine learning tries to answer the following question: can we predict the species of the flower, given only the measurements of its sepal and petal length?
In the next section, we will see how to answer this question with scikit-learn.
If you work with your own datasets, you will have to bring them in a shape that scikit-learn expects, which can be a task of its own. Tools such as Pandas make this task much easier, and Pandas DataFrames can be exported to numpy.ndarray
easily with the as_matrix()
method on DataFrame.
In this section, we will show short examples for both classification and regression.
Classification problems are pervasive: document categorization, fraud detection, market segmentation in business intelligence, and protein function prediction in bioinformatics.
We see that the petal length (the third feature) exhibits the biggest variance, which could indicate the importance of this feature during classification. It is also insightful to plot the data points in two dimensions, using one feature for each axis. Also, indeed, our previous observation reinforced that the petal length might be a good indicator to tell apart the various species. The Iris setosa also seems to be more easily separable than the other two species:
From the visualizations, we get an intuition of the solution to our problem. We will use a supervised method called a Support Vector Machine (SVM) to learn about a classifier for the Iris data. The API separates models and data, therefore, the first step is to instantiate the model. In this case, we pass an optional keyword parameter to be able to query the model for probabilities later:
The next step is to fit the model according to our training data:
With this one line, we have trained our first machine learning model on a dataset. This model can now be used to predict the species of unknown data. If given some measurement that we have never seen before, we can use the predict method on the model:
In fact, the classifier is relatively sure about this label, which we can inquire into by using the predict_proba
method on the classifier:
Our example consisted of four features, but many problems deal with higher-dimensional datasets and many algorithms work fine on these datasets as well.
We first create a sample dataset as follows:
>>> from sklearn.linear_model import LinearRegression >>> clf = LinearRegression() >>> clf.fit(X, y)
We can plot the prediction over our data as well:
The output of the plot is as follows:
The above graph is a simple example with artificial data, but linear regression has a wide range of applications. If given the characteristics of real estate objects, we can learn to predict prices. If given the features of the galaxies, such as size, color, or brightness, it is possible to predict their distance. If given the data about household income and education level of parents, we can say something about the grades of their children.
A lot of existing data is not labeled. It is still possible to learn from data without labels with unsupervised models. A typical task during exploratory data analysis is to find related items or clusters. We can imagine the Iris dataset, but without the labels:
While the task seems much harder without labels, one group of measurements (in the lower-left) seems to stand apart. The goal of clustering algorithms is to identify these groups.
For example, we instantiate the KMeans
model with n_clusters
equal to 3
:
We can already compare the result of these algorithms with our known target labels:
We quickly relabel
the result to simplify the prediction error calculation:
As another example for an unsupervised algorithm, we will take a look at Principal Component Analysis (PCA). The PCA aims to find the directions of the maximum variance in high-dimensional data. One goal is to reduce the number of dimensions by projecting a higher-dimensional space onto a lower-dimensional subspace while keeping most of the information.
The problem appears in various fields. You have collected many samples and each sample consists of hundreds or thousands of features. Not all the properties of the phenomenon at hand will be equally important. In our Iris dataset, we saw that the petal length alone seemed to be a good discriminator of the various species. PCA aims to find principal components that explain most of the variation in the data. If we sort our components accordingly (technically, we sort the eigenvectors of the covariance matrix by eigenvalue), we can keep the ones that explain most of the data and ignore the remaining ones, thereby reducing the dimensionality of the data.
The process is similar to the ones we implemented so far. First, we instantiate our model; this time, the PCA from the decomposition submodule. We also import a standardization method, called StandardScaler
, that will remove the mean from our data and scale to the unit variance. This step is a common requirement for many machine learning algorithms:
Dimensionality reduction is just one way to deal with high-dimensional datasets, which are sometimes effected by the so called curse of dimensionality.
We have already seen that the machine learning process consists of the following steps:
- Model selection: We first select a suitable model for our data. Do we have labels? How many samples are available? Is the data separable? How many dimensions do we have? As this step is nontrivial, the choice will depend on the actual problem. As of Fall 2015, the scikit-learn documentation contains a much appreciated flowchart called choosing the right estimator. It is short, but very informative and worth taking a closer look at.
- Training: We have to bring the model and data together, and this usually happens in the fit methods of the models in scikit-learn.
- Application: Once we have trained our model, we are able to make predictions about the unseen data.
Whether or not a model generalizes well can also be tested. However, it is important that the training and the test input are separate. The situation where a model performs well on a training input but fails on an unseen test input is called overfitting, and this is not uncommon.
Cross-validation (CV) is a technique that does not need a validation set, but still counteracts overfitting. The dataset is split into k
parts (called folds). For each fold, the model is trained on k-1
folds and tested on the remaining folds. The accuracy is taken as the average over the folds.
We will show a five-fold cross-validation on the Iris dataset, using SVC again:
There are various strategies implemented by different classes to split the dataset for cross-validation: KFold
, StratifiedKFold
, LeaveOneOut
, LeavePOut
, LeaveOneLabelOut
, LeavePLableOut
, ShuffleSplit
, StratifiedShuffleSplit
, and PredefinedSplit
.
The most interesting applications will be found in your own field. However, if you would like to get some inspiration, we recommend that you look at the www.kaggle.com website that runs predictive modeling and analytics competitions, which are both fun and insightful.
Practice exercises
Are the following problems supervised or unsupervised? Regression or classification problems?:
- Recognizing coins inside a vending machine
- Recognizing handwritten digits
- If given a number of facts about people and economy, we want to estimate consumer spending
- If given the data about geography, politics, and historical events, we want to predict when and where a human right violation will eventually take place
- If given the sounds of whales and their species, we want to label yet unlabeled whale sound recordings
Look up one of the first machine learning models and algorithms: the perceptron. Try the perceptron on the Iris dataset and estimate the accuracy of the model. How does the perceptron compare to the SVC from this chapter?