In this section, we will turn our focus to feature extraction, which is to develop new features or variables from the available features or information of working datasets. At the same time, we will discuss some of Apache Spark's special capabilities for feature extraction as well as some related feature solutions made easy with Spark.
After this section, we will be able to develop and organize features for various machine learning projects.
For most big data machine learning projects, with many big datasets, we often cannot use them immediately. For example, when we take in some web log data, it is very messy and often in a form such as a collection of random text, from which we need to extract useful information and draw out useful features ready for machine learning. For example, we need to extract number of clicks and number of impressions out from web log data, for which many text mining tools and algorithms are ready to be used.
With any...