-
Book Overview & Buying
-
Table Of Contents
Practical Machine Learning with R
By :
The machine learning process is a sequence of activities performed to deploy a successful model for prediction. A few steps here are iterative and can be repeated based on the outcomes of the previous and the following steps. To train a good model, we must clean our data and understand it well from the business perspective. Using our understanding, we can generate appropriate features. The performance of the model depends on the goodness of the features.
A sample model building process follows these steps:
The model building process consists of obtaining the right data, processing it to derive features, and then choosing the right model. Based on the type of output to be predicted, we need to choose a model. For example, if the output is categorical, a decision tree can be chosen; if the output is numerical, a neural network can be chosen. However, decision trees also can be used for regression and neural networks can be used for classification. Choosing the right data means using data related to the output of our problem. For example, if we are predicting the value of a house, then information such as the location and size are all data that is highly correlated with the predicted value, hence it is of high value. Gathering and deriving good features can make a huge difference.
The raw data refers to the unprocessed data that is relevant for the machine learning problem. For instance, if the problem statement is to predict the value of stocks, then the data constituting the characteristics of a stock and the company profile data may be relevant to the prediction of the stock value; therefore, this data is known as the raw data.
The pre-processing step involves:
The data cleaning step refers to activities such as handling missing values in the data, handling incorrect values, and so on. During the feature selection process, features that correlate with the output or otherwise are termed important are selected for the purpose of modeling. Additional meaningful features that have high correlation with the output to be predicted can also be derived; this helps to improve the model performance.
The data can be split into an 80%-20% ratio, where 80% of the data is used to train the model and 20% of the data is used to test the model. The accuracy of the model on the test data is used as an indicator of the performance of the model.
Therefore, data is split as follows:
The training phase involves the following:
A machine learning model is selected for the purpose of training. The performance of the model is measured in the form of precision, recall, a confusion matrix, and so on. The performance is analyzed, and the parameters of the model are tuned to improve the performance.
The following are some of the metrics used for the evaluation of a machine learning model:
Accuracy is defined as the % of correct predictions made.
Accuracy = Number of correct predictions/Total number of predictions
Let's consider a classification problem that has an output value of either positive or negative. The confusion matrix provides four types of statistics in the form of a matrix for this classification problem.
Let's take an example of patients diagnosed with diabetes. Here, the model has been trained to predict whether a patient has diabetes or not. The actual class means the actual lab result for the person. The predicted class is the predicted result from the trained model.
Precision = True positives/(True positives + False positives)
Recall = True positives/(True positives + False negatives)
The difference between the original value and the predicted value is known as error. The average of the absolute square of the errors is known as the mean squared error.
The difference between the original value and the predicted value is known as error. The average of the errors is known as the mean absolute error.
Root Mean Squared Error (RMSE) is the square root of the mean squared difference between the model predictions and the actual output values.
The Receiver Operating Characteristic (ROC) curve is a visual measure of performance for a classification problem. The curve plots true positive rate to the false positive rate.
This measures the amount of variation in the output variable, which can be explained by the input variables. The greater the value, the more is the variation of output variable by the input variable.
This is used for multiple variable regression problems. It is similar to R-squared, but if the input variable added does does not improve the model's predictions, then the value of adjusted R-squared decreases.
This is also the prediction phase. This is the stage where the model with optimal performance is chosen for deployment. This model will then be used on real data. The performance of the model has to be monitored and the model has to be retrained at regular intervals if required so that the prediction can be made with better accuracy for the future data.
Imagine that you have to create a machine learning model to predict the value/price of the house by training a machine learning model. The process to do this is as follows:
The process is now clear. Let's move on to R programming.
Change the font size
Change margin width
Change background colour