Exploratory Data Analysis (EDA)

Exploratory data analysis (EDA) is defined as a method to analyze datasets and sum up their main characteristics to derive useful conclusions, often with visual methods.

The purpose of EDA is to:

Discover patterns within a dataset
Spot anomalies
Form hypotheses regarding the behavior of data
Validate assumptions

Everything from basic summary statistics to complex visualizations helps us gain an intuitive understanding of the data itself, which is highly important when it comes to forming new hypotheses about the data and uncovering what parameters affect the target variable. Often, discovering how the target variable varies across a single feature gives us an indication of how important a feature might be, and a variation across a combination of several features helps us to come up with ideas for new informative features to engineer.

Most explorations and visualizations are intended to understand the relationship between the features and the target variable. This is because we want to find out what relationships exist (or don't exist) between the data we have and the values we want to predict.

EDA can tell us about:

Features that are unclean, have missing values, or have outliers
Features that are informative and are a good indicator of the target
The kind of relationships features have with the target
Further features that the data might need that we don't already have
Edge cases you might need to account for separately
Filters you might need to apply to the dataset
The presence of incorrect or fake data points

Now that we've looked at why EDA is important and what it can tell us, let's talk about what exactly EDA involves. EDA can involve anything from looking at basic summary statistics to visualizing complex trends over multiple variables. However, even simple statistics and plots can be powerful tools, as they may reveal important facts about the data that could change our modeling perspective. When we see plots representing data, we are able to easily detect trends and patterns, compared to just raw data and numbers. These visualizations further allow us to ask questions such as "How?" and "Why?", and form hypotheses about the dataset that can be validated by further visualizations. This is a continuous process that leads to a deeper understanding of the data.

The dataset that we will use for our exploratory analysis and visualizations has been taken from the Significant Earthquake Database from NOAA, available as a public dataset on Google BigQuery (table ID: 'bigquery-public-data.noaa_significant_earthquakes.earthquakes'). We will be using a subset of the columns available, the metadata for which is available at https://console.cloud.google.com/bigquery?project=packt-data&folder&organizationId&p=bigquery-public-data&d=noaa_significant_earthquakes&t=earthquakes&page=table, and will load it into a pandas DataFrame to perform the exploration. We'll primarily be using Matplotlib for most of our visualizations, along with the Seaborn and Missingno libraries for some. It is to be noted, however, that Seaborn merely provides a wrapper over Matplotlib's functionalities, so anything that is plotted using Seaborn can also be plotted using Matplotlib. We'll try to keep things interesting by using visualizations from both libraries.

The exploration and analysis will be conducted keeping in mind a sample problem statement: Given the data we have, we want to predict whether an earthquake caused a tsunami. This will be a classification problem (more on this in Chapter 5, Classification Techniques) where the target variable is the flag_tsunami column.

Before we begin, let's first import the required libraries, which we will be using for most of our data manipulations and visualizations.

In a Jupyter notebook, import the following libraries:

import json
import pandas as pd
import numpy as np
import missingno as msno
from sklearn.impute import SimpleImputer
import matplotlib.pyplot as plt
import seaborn as sns

We can also read in the metadata containing the data types for each column, which are stored in the form of a JSON file. Do this using the following command. This command opens the file in a readable format and uses the json library to read the file into a dictionary:

with open('..\dtypes.json', 'r') as jsonfile:
    dtyp = json.load(jsonfile)

Note

The output of the preceding command can be found here: https://packt.live/3a4Zjhm

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

The Supervised Learning Workshop - Second Edition

By : Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, Ishita Mathur, Tiffany Ford, Sukanya Mandal, Ashish Pratik Patil

The Supervised Learning Workshop

By: Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, Ishita Mathur, Tiffany Ford, Sukanya Mandal, Ashish Pratik Patil

Overview of this book

Exploratory Data Analysis (EDA)

The Supervised Learning Workshop - Second Edition

By : Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, Ishita Mathur, Tiffany Ford, Sukanya Mandal, Ashish Pratik Patil

The Supervised Learning Workshop

By: Blaine Bateman, Ashish Ranjan Jha, Benjamin Johnston, Ishita Mathur, Tiffany Ford, Sukanya Mandal, Ashish Pratik Patil

Overview of this book

Exploratory Data Analysis (EDA)

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access