Book Image

Spark for Data Science

By : Srinivas Duvvuri, Bikramaditya Singhal
Book Image

Spark for Data Science

By: Srinivas Duvvuri, Bikramaditya Singhal

Overview of this book

This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.
Table of Contents (18 chapters)
Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface

Challenges with big data analytics


There are broadly two types of formidable challenges in the analysis of big data. The first challenge is the requirement for a massive computation platform, and once it is in place, the second challenge is to analyze and make sense out of huge data at scale.

Computational challenges

With the increase in data, the storage requirement for big data also grew more and more. Data management became a cumbersome task. The latency involved in accessing the disk storage due to the seek time became the major bottleneck even though the processing speed of the processor and the frequency of RAM were up to the mark.

Fetching structured and unstructured data from across the gamut of business applications and data silos, consolidating them, and processing them to find useful business insights was challenging. There were only a few applications that could address any one area, or just a few areas of diversified business requirement. However, integrating those applications to address most of the business requirements in a unified way only increased the complexity.

To address these challenges, people turned to the distributed computing framework with distributed file system, for example, Hadoop and Hadoop Distributed File System (HDFS). This could eliminate the latency due to disk I/O, as the data could be read in parallel across the cluster of machines.

Distributed computing technologies had existed for decades before, but gained more prominence only after the importance of big data was realized in the industry. So, technology platforms such as Hadoop and HDFS or Amazon S3 became the industry standard. On top of Hadoop, many other solutions such as Pig, Hive, Sqoop, and others were developed to address different kinds of industry requirements such as storage, Extract, Transform, and Load (ETL), and data integration to make Hadoop a unified platform.

Analytical challenges

Analyzing data to find some hidden insights has always been challenging because of the additional intricacies involved in dealing with huge datasets. The traditional BI and OLAP solutions could not address most of the challenges that arose due to big data. As an example, if there were multiple dimensions to a dataset, say 100, it got really difficult to compare these variables with one another to draw a conclusion because there would be around 100C2 combinations for it. Such cases required statistical techniques such as correlation and the like to find the hidden patterns.

Though there were statistical solutions to many problems, it got really difficult for data scientists or analytics professionals to slice and dice the data to find intelligent insights unless they loaded the entire dataset into a DataFrame in memory. The major roadblock was that most of the general-purpose algorithms for statistical analysis and machine learning were single-threaded and written at a time when datasets were usually not so huge and could fit in the RAM on a single computer. Those algorithms written in R or Python were no longer very useful in their native form to be deployed on a distributed computing environment because of the limitation of in-memory computation.

To address this challenge, statisticians and computer scientists had to work together to rewrite most of the algorithms that would work well in a distributed computing environment. Consequently, a library called Mahout for machine learning algorithms was developed on Hadoop for parallel processing. It had most of the common algorithms that were being used most often in the industry. Similar initiatives were taken for other distributed computing frameworks.