Book Image

Fast Data Processing with Spark - Second Edition

By : Krishna Sankar, Holden Karau
Book Image

Fast Data Processing with Spark - Second Edition

By: Krishna Sankar, Holden Karau

Overview of this book

<p>Spark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.</p> <p>Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes.</p>
Table of Contents (18 chapters)
Fast Data Processing with Spark Second Edition
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

About the Reviewers

Robin East has served a wide range of roles covering operations research, finance, IT system development, and data science. In the 1980s, he was developing credit scoring models using data science and big data before anyone (including himself) had even heard of those terms! In the last 15 years, he has worked with numerous large organizations, implementing enterprise content search applications, content intelligence systems, and big data processing systems. He has created numerous solutions, ranging from swaps and derivatives in the banking sector to fashion analytics in the retail sector.

Robin became interested in Apache Spark after realizing the limitations of the traditional MapReduce model with respect to running iterative machine learning models. His focus is now on trying to further extend the Spark machine learning libraries, and also on teaching how Spark can be used in data science and data analytics through his blog, Machine Learning at Speed (http://mlspeed.wordpress.com).

Before NoSQL databases became the rage, he was an expert on tuning Oracle databases and extracting maximum performance from EMC Documentum systems. This work took him to clients around the world and led him to create the open source profiling tool called DFCprof that is used by hundreds of EMC users to track down performance problems. For many years, he maintained the popular Documentum internals and tuning blog, Inside Documentum (http://robineast.wordpress.com), and contributed hundreds of posts to EMC support forums. These community efforts bore fruit in the form of the award of EMC MVP and acceptance into the EMC Elect program.

Toni Verbeiren graduated as a PhD in theoretical physics in 2003. He used to work on models of artificial neural networks, entailing mathematics, statistics, simulations, (lots of) data, and numerical computations. Since then, he has been active in the industry in diverse domains and roles: infrastructure management and deployment, service management, IT management, ICT/business alignment, and enterprise architecture. Around 2010, Toni started picking up his earlier passion, which was then named data science. The combination of data and common sense can be a very powerful basis to make decisions and analyze risk.

Toni is active as an owner and consultant at Data Intuitive (http://www.data-intuitive.com/) in everything related to big data science and its applications to decision and risk management. He is currently involved in Exascience Life Lab (http://www.exascience.com/) and the Visual Data Analysis Lab (http://vda-lab.be/), which is concerned with scaling up visual analysis of biological and chemical data.

Lijie Xu is a PhD student at the Institute of Software, Chinese Academy of Sciences. His research interests focus on distributed systems and large-scale data analysis. He has both academic and industrial experience in Microsoft Research Asia, Alibaba Taobao, and Tencent. As an open source software enthusiast, he has contributed to Apache Spark and written a popular technical report, named Spark Internals, in Chinese at https://github.com/JerryLead/SparkInternals/tree/master/markdown.