Book Image

Jupyter for Data Science

By : Dan Toomey
Book Image

Jupyter for Data Science

By: Dan Toomey

Overview of this book

Jupyter Notebook is a web-based environment that enables interactive computing in notebook documents. It allows you to create documents that contain live code, equations, and visualizations. This book is a comprehensive guide to getting started with data science using the popular Jupyter notebook. If you are familiar with Jupyter notebook and want to learn how to use its capabilities to perform various data science tasks, this is the book for you! From data exploration to visualization, this book will take you through every step of the way in implementing an effective data science pipeline using Jupyter. You will also see how you can utilize Jupyter's features to share your documents and codes with your colleagues. The book also explains how Python 3, R, and Julia can be integrated with Jupyter for various data science tasks. By the end of this book, you will comfortably leverage the power of Jupyter to perform various tasks in data science successfully.
Table of Contents (17 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

Loading JSON into Spark


Spark can also access JSON data for manipulation. Here we have an example that:

  • Loads a JSON file into a Spark data frame
  • Examines the contents of the data frame and displays the apparent schema
  • Like the other preceding data frames, moves the data frame into the context for direct access by the Spark session
  • Shows an example of accessing the data frame in the Spark context

The listing is as follows:

Our standard includes for Spark:

from pyspark import SparkContextfrom pyspark.sql import SparkSession sc = SparkContext.getOrCreate()spark = SparkSession(sc)

Read in the JSON and display what we found:

#using some data from file from https://gist.github.com/marktyers/678711152b8dd33f6346df = spark.read.json("people.json")df.show()

I had a difficult time getting a standard JSON to load into Spark. Spark appears to expect one record of data per list of the JSON file versus most JSON I have seen pretty much formats the record layouts with indentation and the like.

Note

Notice the use...