Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Apache Spark Quick Start Guide
  • Table Of Contents Toc
Apache Spark Quick Start Guide

Apache Spark Quick Start Guide

By : Shrey Mehrotra, Grade
3 (1)
close
close
Apache Spark Quick Start Guide

Apache Spark Quick Start Guide

3 (1)
By: Shrey Mehrotra, Grade

Overview of this book

Apache Spark is a ?exible framework that allows processing of batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to get started with Apache Spark 2.0 and write big data applications for a variety of use cases. It will also introduce you to Apache Spark – one of the most popular Big Data processing frameworks. Although this book is intended to help you get started with Apache Spark, but it also focuses on explaining the core concepts. This practical guide provides a quick start to the Spark 2.0 architecture and its components. It teaches you how to set up Spark on your local machine. As we move ahead, you will be introduced to resilient distributed datasets (RDDs) and DataFrame APIs, and their corresponding transformations and actions. Then, we move on to the life cycle of a Spark application and learn about the techniques used to debug slow-running applications. You will also go through Spark’s built-in modules for SQL, streaming, machine learning, and graph analysis. Finally, the book will lay out the best practices and optimization techniques that are key for writing efficient Spark applications. By the end of this book, you will have a sound fundamental understanding of the Apache Spark framework and you will be able to write and optimize Spark applications.
Table of Contents (10 chapters)
close
close

DataFrames

As we already mentioned, DataFrame APIs are abstractions of RDD APIs. DataFrames are distributed collections of data that are organized in the form of rows and columns. In other words, DataFrames provide APIs to efficiently process structured data that's available in different sources. The sources could be an RDD, different types of files in a filesystem, any RDBMS, or Hive tables.

The features of DataFrames are as follows:

  • DataFrames can process data that's available in different formats, such as CSV, AVRO, and JSON, or stored in any storage media, such as Hive, HDFS, and RDBMS
  • DataFrames can process data volumes from kilobytes to petabytes
  • Use the Spark-SQL query optimizer to process data in a distributed and optimized manner
  • Support for APIs in multiple languages, including Java, Scala, Python, and R
...
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Apache Spark Quick Start Guide
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon