In this chapter, we will present the main features of data processing architecture and the Cloudera platform distribution. Then, we will explore how to use a distributed filesystem and how to managing files from terminal and using a web interface. Finally, we will describe the use of Apache Spark, which is an open source, big data processing framework built with the goal of being fast and easy to use. Apache Spark provides us with a unified framework to manage big data processing requirements, such as data streaming, machine learning, and analytics.
In this chapter, we will cover these topics:
Understanding data processing
Platform for data processing
An introduction to the distributed file system
An introduction to Apache Spark
Understanding data processing
Since the first edition of this book in 2013, there has been big changes in the data-driven scene. With the emerge of buzzwords such as big data, data science, and deep learning...