Fast Data Processing with Spark

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau

Buy this Book

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Buy this Book

Overview of this book

<p>Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.</p> <p>Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.</p>

Fast Data Processing with Spark Second Edition

Credits

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Installing Spark and Setting up your Cluster

Directory organization and convention

Installing prebuilt distribution

Building Spark from source

Spark topology

A single machine

Running Spark on EC2

Deploying Spark with Chef (Opscode)

Deploying Spark on Mesos

Spark on YARN

Spark Standalone mode

Summary

Using the Spark Shell

Loading a simple text file

Using the Spark shell to run logistic regression

Interactively loading data from S3

Summary

Building and Running a Spark Application

Building your Spark project with sbt

Building your Spark job with Maven

Building your Spark job with something else

Summary

Creating a SparkContext

Scala

Java

SparkContext – metadata

Shared Java and Scala APIs

Python

Summary

Loading and Saving Data in Spark

RDDs

Loading data into an RDD

Saving your data

Summary

Manipulating your RDD

Manipulating your RDD in Scala and Java

Manipulating your RDD in Python

Summary

Spark SQL

The Spark SQL architecture

Summary

Spark with Big Data

Parquet – an efficient and interoperable big data format

Querying Parquet files with Impala

HBase

Summary

Machine Learning Using Spark MLlib

The Spark machine learning algorithm table

Spark MLlib examples

Summary

Testing

Testing in Java and Scala

Testing in Python

Summary

Tips and Tricks

Where to find logs

Concurrency limitations

Using Spark with other languages

A quick note on security

Community developed packages

Mailing lists

Summary

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

About the Reviewers

Robin East has served a wide range of roles covering operations research, finance, IT system development, and data science. In the 1980s, he was developing credit scoring models using data science and big data before anyone (including himself) had even heard of those terms! In the last 15 years, he has worked with numerous large organizations, implementing enterprise content search applications, content intelligence systems, and big data processing systems. He has created numerous solutions, ranging from swaps and derivatives in the banking sector to fashion analytics in the retail sector.

Robin became interested in Apache Spark after realizing the limitations of the traditional MapReduce model with respect to running iterative machine learning models. His focus is now on trying to further extend the Spark machine learning libraries, and also on teaching how Spark can be used in data science and data analytics through his blog, Machine Learning at Speed (http://mlspeed.wordpress.com).

Before NoSQL databases became the rage, he was an expert on tuning Oracle databases and extracting maximum performance from EMC Documentum systems. This work took him to clients around the world and led him to create the open source profiling tool called DFCprof that is used by hundreds of EMC users to track down performance problems. For many years, he maintained the popular Documentum internals and tuning blog, Inside Documentum (http://robineast.wordpress.com), and contributed hundreds of posts to EMC support forums. These community efforts bore fruit in the form of the award of EMC MVP and acceptance into the EMC Elect program.

Toni Verbeiren graduated as a PhD in theoretical physics in 2003. He used to work on models of artificial neural networks, entailing mathematics, statistics, simulations, (lots of) data, and numerical computations. Since then, he has been active in the industry in diverse domains and roles: infrastructure management and deployment, service management, IT management, ICT/business alignment, and enterprise architecture. Around 2010, Toni started picking up his earlier passion, which was then named data science. The combination of data and common sense can be a very powerful basis to make decisions and analyze risk.

Toni is active as an owner and consultant at Data Intuitive (http://www.data-intuitive.com/) in everything related to big data science and its applications to decision and risk management. He is currently involved in Exascience Life Lab (http://www.exascience.com/) and the Visual Data Analysis Lab (http://vda-lab.be/), which is concerned with scaling up visual analysis of biological and chemical data.

Lijie Xu is a PhD student at the Institute of Software, Chinese Academy of Sciences. His research interests focus on distributed systems and large-scale data analysis. He has both academic and industrial experience in Microsoft Research Asia, Alibaba Taobao, and Tencent. As an open source software enthusiast, he has contributed to Apache Spark and written a popular technical report, named Spark Internals, in Chinese at https://github.com/JerryLead/SparkInternals/tree/master/markdown.

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Overview of this book

Related Content you might be interested in

Current Title:

Fast Data Processing with Spark - Second Edition

About the Reviewers