Why use F#?
F# is an open source, functionalfirst, general purpose programming language and is particularly suitable for developing mathematical models that are an integral part of machine learning algorithm development.
Code written in F# is generally very expressive and is close to its actual algorithm description. That's why you shall see more and more mathematically inclined domains adopting F#.
At every stage of a machine learning activity, F# has a feature or an API to help. Following are the major steps in a machine learning activity:
Major step in machine learning activity 
How F# can help 

Data Acquisition 
F# type providers are great at it. (Refer to http://blogs.msdn.com/b/dsyme/archive/2013/01/30/twelvetypeprovidersinpictures.aspx) F# can help you get the data from the following resources using F# type providers:

Data Scrubbing/Data Cleansing 
F# list comprehensions are perfect for this task. Deedle (http://bluemountaincapital.github.io/Deedle/) is an API written in F#, primarily for exploratory data analysis. This framework also has lot of features that can help in the data cleansing phase. 
Learning the Model 
WekaSharp is an F# wrapper on top of Weka to help with machine learning tasks such as regression, clustering, and so on. Accord.NET is a massive framework for performing a very diverse set of machine learning. 
Data Visualization 
F# charts are very interactive and intuitive to easily generate high quality charts. Also, there are several APIs, such as FsPlot, that take the pain of conforming to standards when it comes to plugging data to visualization. 
F# has a way to name a variable the way you want if you wrap it with double back quotes like—"my variable". This feature can make the code much more readable.
Supervised machine learning
Supervised machine learning algorithms are mostly broadly classified into two major categories: classification and regression.
Supervised machine learning algorithms work with labeled datasets. This means that the algorithm takes a lot of labeled data sets, where the data represents the instance and the label represents the class of the data. Sometimes these labels are finite in number and sometimes they are continuous numbers. When the labels belong to a finite set, then the problem of identifying the label of an unknown/new instance is known as a classification problem. On the other hand, if the label is a continuous number, then the problem of finding the continuous value for a new instance is known as a regression problem. Given a set of records for cancer patients, with test results and labels (B for benign and M for malignant) predicting whether a new patient's case is B or M, is a classification problem. On the other hand, predicting the price of a house, given the area in square feet and the number of bedrooms in the house, is an example of a regression problem.
I found the following analogy to geometry very useful when thinking about these algorithms.
Let's say you have two points in 2D. You can calculate the Euclidean distance between those two and if that distance is small, you can conclude that those points are close to each other. In other words, if those two points represent two cities in a country, you might conclude that they are in the same district.
Now if you extrapolate this theory to the N dimension, you can immediately see that any measurement can be represented as a point with the N dimension or as a vector of size N and a label can be associated with it. Then an algorithm can be deployed to learn the associativity or the pattern, and thus it learns to predict the label for an unseen/unknown/new instance represented in the similar format.
Training and test dataset/corpus
The phase when an algorithm runs over a labeled data set is known as training, and the labeled data is known as training dataset. Sometimes it is loosely referred to as training corpus. Later in the process, the algorithm must be tested with similar unlabeled datasets or for which the label is hidden from the algorithm. This dataset is known as test dataset or test corpus. Typically, an 8020 split is used to generate the training and test set from a randomly shuffled labeled data set. This means that 80% of the randomly shuffled labeled data is generally treated as training data and the remaining 20% as test data.
Some motivating real life examples of supervised learning
Supervised learning algorithms have several applications. Following are some of those. This is by no means a comprehensive list, but it is indicative.
Classification
Spam filtering in your mailbox
Cancer prediction from the previous patient records
Identifying objects in images/videos
Identifying flowers from measurements
Identifying hand written digits on cheques
Predicting whether there will be a traffic jam in a city
Making recommendations to the users based on their and similar user's preferences
Regression
Predicting the price of houses based on several features, such as the number of bedrooms in the house
Finding causeeffect relationships between several variables
Supervised learning algorithms
Nearest Neighbor algorithm
Support Vector Machine
Decision Tree
Linear Regression
Logistic Regression
Naïve Bayes Classifier
Neural Networks
Recommender systems algorithms
Anomaly Detection
Sentiment Analysis
In the next few sections, I will walk you through the overview of a few of these algorithms and their mathematical basis. However, we will get away with as minimal math as possible, since the objective of the book is to help you use machine learning in the real settings.
Nearest Neighbour algorithm (a.k.a kNN algorithm)
As the name suggests, kNearest Neighbor is an algorithm that works on the distance between two projected points. It relies on the distance of knearest neighbors (thus the name) to determine the class/category of the unknown/new test data.
As the name suggests the nearest neighbor algorithm relies on the distance of two data points projected in NDimensional space. Let's take a popular example where the kNN can be used to determine the class. The dataset https://archive.ics.uci.edu/ml/machinelearningdatabases/breastcancerwisconsin/wdbc.data stores data about several patients who were either unfortunate and diagnosed as "Malignant" cases (which are represented as M in the dataset), or were fortunate and diagnosed as "Benign" (nonharmful/noncancerous) cases (which are represented as B in the dataset). If you want to understand what all the other fields mean, take a look at https://archive.ics.uci.edu/ml/machinelearningdatabases/breastcancerwisconsin/wdbc.names.
Now the question is, given a new entry with all the other records except the tag M or B, can we predict that? In ML terminology, this value "M" or "B" is sometimes referred to as "class tag" or just "class". The task of a classification algorithm is to determine this class for a new data point. KNN does this in the following way: it measures the distance from the given data to all the training data and then takes into consideration the classes for only the knearest neighbors to determine the class of the new entry. So for the current case, if more than 50% of the knearest neighbors is of class "B", then kNN will conclude that the new entry is of type "B".
Distance metrics
The distance metric used is generally Euclidean, that you learnt in high school. For example, given two points in 3D.
In this preceding example, and denote their values in the X axis, and denote their values in the Y axis, and and denote their values in the axis.
Extrapolating this, we get the following formula for calculating the distance in N dimension:
Thus, after calculating the distance from all the training set data, we can create a list of tuples with the distance and the class, as follows. This list is made for the sake of demonstration. This is not calculated from the actual data.
Distance from test/new data 
Class/Tag/Category 

0.34235 
B 
0.45343 
B 
1.34233 
B 
6.23433 
M 
66.3435 
M 
Let's assume that k is set to be 4. Now for each k, we take into consideration the class. So for the first three entries, we found that the class is B and for the last one, it is M. Since the number of B's is more than the number of M's, kNN will conclude that the new patient's data is of type B.
Decision tree algorithms
Have you ever played the game where you had to guess about a thing that your friend had been thinking about by asking questions? And you were allowed to guess only a certain number of times and had to get back to your friend with your answer about what he/she could probably be thinking about.
The strategy to guess the correct answer is to start asking questions that segregate the possible answer space as evenly as possible. For example, if your friend told you that he/she had imagined about something, then probably the first question you would like to ask him/her is that whether he/she is thinking about an animal or a thing. That would broadly classify the answer space and then later you can ask more direct/specific questions based on the answers previously provided by your friend.
Decision tree is a set of classification algorithm that uses this approach to determine the class of an unknown entry. As the name suggests, a decision tree is a tree where the nodes depict the questions asked and the edges represent the decisions (yes/no). Leaf nodes determine the final class of the unknown entry. Following is a classic textbook example of a decision tree:
The preceding figure depicts the decision whether we can play lawn tennis or not, based on several attributes such as Outlook, Humidity, and Wind. Now the question that you may have is why outlook was chosen as the root node of the tree. The reason was that by choosing outlook as the first feature/attribute to split the dataset, the outcomes were split more evenly than if the split had been done with other attributes such as "humidity" or "wind".
The process of finding the attribute that can split the dataset more evenly than others is guided by entropy. Lesser the entropy, better the parameter. Entropy is known as the measure of information gain. It is calculated by the following formula:
Here stands for the probability of and denotes the information gain.
Let's take the example of tennis dataset from Weka. Following is the file in the CSV format:
outlook,temperature,humidity,wind,playTennis
sunny, hot, high, weak, no
sunny, hot, high, strong, no
overcast, hot, high, weak, yes
rain, mild, high, weak, yes
rain,cool, normal, weak, yes
rain, cool, normal, strong, no
overcast, cool, normal, strong, yes
sunny, mild, high, weak, no
sunny, cool, normal, weak, yes
rain, mild, normal, weak, yes
sunny, mild, normal, strong, yes
overcast, mild, high, strong, yes
overcast, hot, normal, weak, yes
rain, mild, high, strong, no
You can see from the dataset that out of 14 instances (there are 14 rows in the file), 5 instances had the value no
for playTennis
and 9 instances had the value yes
. Thus, the overall information is given by the following formula:
This roughly evaluates to 0.94. Now from the next steps, we must pick the attribute that maximizes the information gain. Information gain is denoted as the difference between the total entropy and the entropy calculated for each possible split.
Let's go with one example. For the outlook attribute, there are three possible values: rain
, sunny
, and overcast
, and for each of these values, the value of the attribute playTennis
is either no
or yes
.
For rain
, out of 5
instances, 3
instances have the value yes
for the attribute playTennis
; thus, the entropy is as follows:
This is equal to 0.97.
For overcast
, every instance has the value yes
:
This is equal to 0.0.
For sunny
, out of 5
instances, only 2
have the value yes
:
So the expected new entropy is given by the following formula:
This is roughly equal to 0.69. If you follow these steps for the other attributes, you will find that the new entropies are like as follows:
Attribute 
Entropy 
Information gain 

outlook 
0.69 
0.94 – 0.69 => 0.25 
temperature 
0.91 
0.94 – 0.91 => 0.03 
humidity 
0.724 
0.94 – 0.725 => 0.215 
windy 
0.87 
0.94 – 0.87 => 0.07 
So the highest information gain is attained if we split the dataset based on the outlook attribute.
Sometimes multiple trees are constructed by generating a random subset of all the available features. This technique is known as random forest.
Linear regression
Regression is used to predict the target value of the real valued variable. For example, let's say we have data about the number of bedrooms and the total area of many houses in a locality. We also have their prices listed as follows:
Number of Bedrooms 
Total Area in square feet 
Price 

2 
1150 
2300000 
3 
2500 
5600000 
3 
1780 
4571030 
4 
3000 
9000000 
Now let's say we have this data in a real estate site's database and we want to create a feature to predict the price of a new house with three bedrooms and total area of 1650 square feet.
Linear regression is used to solve these types of problems. As you can see, these types of problems are pretty common.
In linear regression, you start with a model where you represent the target variable—the variable for which you want to predict the value. A polynomial model is selected that minimizes the least square error (this will be explained later in the chapter). Let me walk you through this example.
Each row of the available data can be represented as a tuple where the first few elements represent the value of the known/input parameters and the last parameter shows the value of the price (the target variable). So taking inspiration from mathematics, we can represent the unknown with and known as . Thus, each row can be represented as where to represent the parameters (the total area and the number of bedrooms) and represents the target value (the price of the house). Linear regression works on a model where y is represented with the x values.
The hypothesis is represented by an equation as the following. Here and theta denotes the input parameters (the number of bedrooms and the total area in square feet) and represents the predicted value of the new house.
Note that this hypothesis is still a polynomial model and we are just using two features: the number of bedrooms and the total area represented by and .
So the square error is calculated by the following formula:
The task of linear regression is to choose a set of values for the coefficients which minimizes this error. The algorithm that minimizes this error is called gradient descent or batch gradient descent. You will learn more about it in Chapter 2, Linear Regression.
Logistic regression
Unlike linear regression, logistic regression predicts a Boolean value indicating the class/tag/category of the target variable. Logistic regression is one of the most popular binary classifiers and is modelled by the equation that follows. and stands for the independent input variables and their classes/tags respectively. Logistic regression is discussed at length in Chapter 3, Classification Techniques.
Recommender systems
Whenever you buy something from the web (say Amazon), it recommends you stuff that you might find interesting and might eventually buy as well. This is the result of recommender system. Let's take the following example of a movie rating:
Movie 
Bob 
Lucy 
Jane 
Jennifer 
Jacob 

Paper Towns 
1 
3 
4 
2 
1 
Focus 
2 
5 
? 
3 
2 
Cinderella 
2 
? 
4 
2 
3 
Jurrasic World 
3 
1 
4 
5 
? 
Die Hard 
5 
? 
4 
5 
5 
So in this toy example, we have 5 users and they have rated 5 movies. But not all the users have rated all the movies. For example, Jane hasn't rated "Focus" and Jacob hasn't rated "Jurassic World". The task of a recommender system is to initially guess what would be the ratings for the movies that aren't rated by the user and then recommend movies that have a guessed rating which is beyond a threshold (say 3).
There are several algorithms to solve this problem. One popular algorithm is known as collaborative filtering where the algorithm takes clues from the other user ratings. You will learn more about this in Chapter 5, Collaborative Filtering.