3. Introduction to DataFrames | Spark for Data Science

Sign In Start Free Trial

Book Overview & Buying
Table Of Contents

Spark for Data Science

By : Duvvuri, Singhal

Spark for Data Science

By: Duvvuri, Singhal

Overview of this book

This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.

Preface

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Reader feedback

Customer support

Free Chapter

1. Big Data and Data Science – An Introduction

1. Big Data and Data Science – An Introduction

Big data overview

Challenges with big data analytics

Evolution of big data analytics

Spark for data analytics

The Spark stack

Summary

References

2. The Spark Programming Model

2. The Spark Programming Model

The programming paradigm

The Spark engine

The RDD API

RDD operations

Summary

References

3. Introduction to DataFrames

3. Introduction to DataFrames

Why DataFrames?

Spark SQL

The DataFrame API

Creating DataFrames

DataFrame operations

Summary

References

4. Unified Data Access

4. Unified Data Access

Data abstractions in Apache Spark

Datasets

Spark SQL

Structured Streaming

Continuous applications

Summary

References

5. Data Analysis on Spark

5. Data Analysis on Spark

Data analytics life cycle

Data acquisition

Data preparation

Basics of statistics

Descriptive statistics

Inferential statistics

Summary

References

6. Machine Learning

6. Machine Learning

Introduction

MLlib and the Pipeline API

Introduction to machine learning

Regression methods

Classification methods

Linear Support Vector Machines (SVM)

Training an SVM

Decision trees

Ensembles

Multilayer perceptron classifier

Clustering techniques

Summary

References

7. Extending Spark with SparkR

7. Extending Spark with SparkR

SparkR basics

Advantages and limitations

Programming with SparkR

SparkR DataFrames

Machine learning

Summary

References

8. Analyzing Unstructured Data

8. Analyzing Unstructured Data

Sources of unstructured data

Processing unstructured data

Text classification

Text clustering

Dimensionality reduction

Singular Value Decomposition

Summary

References:

9. Visualizing Big Data

9. Visualizing Big Data

Why visualize data?

Data visualization tools

Data visualization techniques

Summary

References

10. Putting It All Together

10. Putting It All Together

A quick recap

Introducing a case study

The business problem

Data acquisition and data cleansing

Developing the hypothesis

Data exploration

Data preparation

Model building

Data visualization

Communicating the results to business users

Summary

References

11. Building Data Science Applications

11. Building Data Science Applications

Scope of development

The Scala advantage

Spark development status

The big data trends

Summary

References

Creating DataFrames

Spark DataFrame creation is similar to RDD creation. To get access to the DataFrame API, you need SQLContext or HiveContext as an entry point. In this section, we are going to demonstrate how to create DataFrames from various data sources, starting from basic code examples with in-memory collections:

Creating DataFrames from RDDs

The following code creates an RDD from a list of colors followed by a collection of tuples containing the color name and its length. It creates a DataFrame using the toDF method to convert the RDD into a DataFrame. The toDF method takes a list of column labels as an optional argument:

Python:

   //Create a list of colours 
>>> colors = ['white','green','yellow','red','brown','pink'] 
//Distribute a local collection to form an RDD 
//Apply map function on that RDD to get another RDD containing colour, length tuples 
>>> color_df = sc.parallelize(colors) 
        .map(lambda x:(x,len(x))).toDF(["color","length"]) 
 
>>&gt...

CONTINUE READING

83

Tech Concepts

36

Programming languages

73

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Spark for Data Science

Search

Your notes and bookmarks