Summary
Hadoop is a very useful tool for big data transformation and processing. It can come in handy at almost all the stages of the data analytics workflow. Data analytics is not about the algorithms but more about the data. Larger data can yield almost two-fold improvements in prediction. A data scientist should worry more about the cleansing, transformation, feature engineering, and validation of results rather than the actual algorithm that will be used to do the analysis. This does not mean that the analysis algorithm choice is not important. Instead, it means that there are other players that are equally important and vital for healthy decision making.
In this chapter, the key takeaways are as follows:
Hadoop is generally used for analytics on data sizes of 1 TB and above. However, the ease of use brought about by functional programming concepts in Hadoop tempts people to use it for smaller data sizes. There is nothing wrong with this approach as long as they are cognizant of the fact...