In this chapter, a complete ML pipeline was implemented, from collecting historical data, to transforming it into a format suitable for testing hypotheses, training ML models, and running a prediction on Live
data, and with the possibility to evaluate many different models and select the best one.
The test results showed that, as in the original dataset, about 600,000 minutes out of 2.4 million can be classified as increasing price (close price was higher than open price); the dataset can be considered imbalanced. Although random forests are usually performed well on an imbalanced dataset, the area under the ROC curve of 0.74 isn't best. As we need to have fewer false positives (fewer times when we trigger purchase and the price drops), we might consider a punishing model for such errors in a stricter way.
Although the results achieved by classifiers can't be used for profitable trading, there is a foundation on top of which new approaches can be tested in a relatively rapid way. Here...