I covered the basics of the MLlib library in the previous chapter, but MLlib, at least at the time of writing this book, is more like a fast-moving target that is gaining the lead rather than a well-structured implementation that everyone uses in production or even has a consistent and tested documentation. In this situation, as people say, rather than giving you the fish, I will try to focus on well-established concepts behind the libraries and teach the process of fishing in this book in order to avoid the need to drastically modify the chapters with each new MLlib release. For better or worse, this increasingly seems to be a skill that a data scientist needs to possess.
Statistics and machine learning inherently deal with uncertainty, due to one or another reason we covered in Chapter 2, Data Pipelines and Modeling. While some datasets might be completely random, the goal here is to find trends, structure, and patterns beyond what a random...