Fast Data Processing with Spark

Fast Data Processing with Spark

By : Holden Karau

Buy this Book

Fast Data Processing with Spark

By: Holden Karau

Buy this Book

Overview of this book

Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets. Fast Data Processing with Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes. Fast Data Processing with Spark covers everything from setting up your Spark cluster in a variety of situations (stand-alone, EC2, and so on), to how to use the interactive shell to write distributed code interactively. From there, we move on to cover how to write and deploy distributed jobs in Java, Scala, and Python. We then examine how to use the interactive shell to quickly prototype distributed programs and explore the Spark API. We also look at how to use Hive with Spark to use a SQL-like query syntax with Shark, as well as manipulating resilient distributed datasets (RDDs).

Fast Data Processing with Spark

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Installing Spark and Setting Up Your Cluster

Running Spark on a single machine

Running Spark on EC2

Deploying Spark on Elastic MapReduce

Deploying Spark with Chef (opscode)

Deploying Spark on Mesos

Deploying Spark on YARN

Deploying set of machines over SSH

Links and references

Summary

Using the Spark Shell

Loading a simple text file

Using the Spark shell to run logistic regression

Interactively loading data from S3

Summary

Building and Running a Spark Application

Building your Spark project with sbt

Building your Spark job with Maven

Building your Spark job with something else

Summary

Creating a SparkContext

Scala

Java

Shared Java and Scala APIs

Python

Links and references

Summary

Loading and Saving Data in Spark

RDDs

Loading data into an RDD

Saving your data

Links and references

Summary

Manipulating Your RDD

Manipulating your RDD in Scala and Java

Manipulating your RDD in Python

Links and references

Summary

Shark – Using Spark with Hive

Using Hive queries in a Spark program

Links and references

Summary

Testing

Testing in Java and Scala

Testing in Python

Links and references

Summary

Tips and Tricks

Where to find logs?

Concurrency limitations

Memory usage and garbage collection

Serialization

IDE integration

Using Spark with other languages

A quick note on security

Mailing lists

Links and references

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Preface

As programmers, we are frequently asked to solve problems or use data that is too much for a single machine to practically handle. Many frameworks exist to make writing web applications easier, but few exist to make writing distributed programs easier. The Spark project, which this book covers, makes it easy for you to write distributed applications in the language of your choice: Scala, Java, or Python.

What this book covers

Chapter 1, Installing Spark and Setting Up Your Cluster, covers how to install Spark on a variety of machines and set up a cluster—ranging from a local single-node deployment suitable for development work to a large cluster administered by a Chef to an EC2 cluster.

Chapter 2, Using the Spark Shell, gets you started running your first Spark jobs in an interactive mode. Spark shell is a useful debugging and rapid development tool and is especially handy when you are just getting started with Spark.

Chapter 3, Building and Running a Spark Application, covers how to build standalone jobs suitable for production use on a Spark cluster. While the Spark shell is a great tool for rapid prototyping, building standalone jobs is the way you will likely find most of your interaction with Spark to be.

Chapter 4, Creating a SparkContext, covers how to create a connection a Spark cluster. SparkContext is the entry point into the Spark cluster for your program.

Chapter 5, Loading and Saving Your Data, covers how to create and save RDDs (Resilient Distributed Datasets). Spark supports loading RDDs from any Hadoop data source.

Chapter 6, Manipulating Your RDD, covers how to do distributed work on your data with Spark. This chapter is the fun part.

Chapter 7, Using Spark with Hive, talks about how to set up Shark—a HiveQL-compatible system with Spark—and integrate Hive queries into your Spark jobs.

Chapter 8, Testing, looks at how to test your Spark jobs. Distributed tasks can be especially tricky to debug, which makes testing them all the more important.

Chapter 9, Tips and Tricks, looks at how to improve your Spark task.

What you need for this book

To get the most out of this book, you need some familiarity with Linux/Unix and knowledge of at least one of these programming languages: C++, Java, or Python. It helps if you have access to more than one machine or EC2 to get the most out of the distributed nature of Spark; however, it is certainly not required as Spark has an excellent standalone mode.

Who this book is for

This book is for any developer who wants to learn how to write effective distributed programs using the Spark project.

Conventions

In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The tarball file contains a bin directory that needs to be added to your path and SCALA_HOME should be set to the path where the tarball is extracted."

Any command-line input or output is written as follows:

./run spark.examples.GroupByTest local[4]

New terms and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "by selecting Key Pairs under Network & Security".

Note

Warnings or important notes appear in a box like this.

Tip

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

To send us general feedback, simply send an e-mail to <[email protected]>, and mention the book title via the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

All of the example code from this book is hosted in three separate github repos:

Disclaimer

The opinions in this book are those of the author and not necessarily those any of my employers, past or present. The author has taken reasonable steps to ensure the example code is safe for use. You should verify the code yourself before using with important data. The author does not give any warranty express or implied or make any representation that the contents will be complete or accurate or up to date. The author shall not be liable for any loss, actions, claims, proceedings, demand or costs or damages whatsoever or howsoever caused arising directly or indirectly in connection with or arising out of the use of this material.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors, and our ability to bring you valuable content.

Questions

You can contact us at <[email protected]> if you are having a problem with any aspect of the book, and we will do our best to address it.

Fast Data Processing with Spark

By : Holden Karau

Fast Data Processing with Spark

By: Holden Karau

Overview of this book

Related Content you might be interested in

Current Title:

Fast Data Processing with Spark

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Tip

Reader feedback

Customer support

Downloading the example code

Disclaimer

Errata

Piracy

Questions