If you have worked in big data, there is a high probability that you already know what Apache Spark is, and you can skip this section. But if you don't, don't worry—we'll go through the basics.
Spark is a powerful, fast, and scalable real-time data analytics engine for large scale data processing. It's an open source framework that was developed initially by the UC Berkeley AMPLab around the year 2009. Around 2013, AMPLab contributed Spark to the Apache Software Foundation, with Apache Spark Community releasing Spark 1.0 in 2014.
The community continues to make regular releases and brings new features into the project. At the time of writing this book, we have the Apache Spark 2.4.0 release and active community on GitHub. It's a real-time data analytics engine that allows you to distribute programs across a cluster of machines.
The beauty of Spark lays in the fact that it's scalable: it runs on top of a cluster manager, allowing you to use the scripts written in Python...