Book Image

Apache Spark 2 for Beginners

By : Rajanarayanan Thottuvaikkatumana
Book Image

Apache Spark 2 for Beginners

By: Rajanarayanan Thottuvaikkatumana

Overview of this book

<p>Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists.</p> <p>This book starts with the fundamentals of Spark 2 and covers the core data processing framework and API, installation, and application development setup. Then the Spark programming model is introduced through real-world examples followed by Spark SQL programming with DataFrames. An introduction to SparkR is covered next. Later, we cover the charting and plotting features of Python in conjunction with Spark data processing. After that, we take a look at Spark's stream processing, machine learning, and graph processing libraries. The last chapter combines all the skills you learned from the preceding chapters to develop a real-world Spark application.</p> <p>By the end of this book, you will have all the knowledge you need to develop efficient large-scale applications using Apache Spark.</p>
Table of Contents (15 chapters)
Apache Spark 2 for Beginners
Credits
About the Author
About the Reviewer
www.PacktPub.com
Preface

Creating RDDs from files


So far, the focus of the discussion was on the RDD functionality and programming with RDDs. In all the preceding use cases, the RDD creation was done from the collection objects. But in the real-world use cases, the data will come from files stored in the local filesystems, and HDFS. Quite often, the data will come from NoSQL data stores such as Cassandra. It is possible to create RDDs by reading the contents from these data sources. Once RDD is created, then all the operations are uniform, as given in the preceding use cases. The data files coming out of the filesystems may be fixed width, comma-separated, or any other format. But the common pattern used for reading such data files is to read the data line by line and split the line to have the necessary separation of data items. In the case of data coming from other sources, the appropriate Spark connector program is to be used and the appropriate API for reading data is to be used.

Many third-party libraries are...