Book Image

Practical Data Analysis Using Jupyter Notebook

By : Marc Wintjen
Book Image

Practical Data Analysis Using Jupyter Notebook

By: Marc Wintjen

Overview of this book

Data literacy is the ability to read, analyze, work with, and argue using data. Data analysis is the process of cleaning and modeling your data to discover useful information. This book combines these two concepts by sharing proven techniques and hands-on examples so that you can learn how to communicate effectively using data. After introducing you to the basics of data analysis using Jupyter Notebook and Python, the book will take you through the fundamentals of data. Packed with practical examples, this guide will teach you how to clean, wrangle, analyze, and visualize data to gain useful insights, and you'll discover how to answer questions using data with easy-to-follow steps. Later chapters teach you about storytelling with data using charts, such as histograms and scatter plots. As you advance, you'll understand how to work with unstructured data using natural language processing (NLP) techniques to perform sentiment analysis. All the knowledge you gain will help you discover key patterns and trends in data using real-world examples. In addition to this, you will learn how to handle data of varying complexity to perform efficient data analysis using modern Python libraries. By the end of this book, you'll have gained the practical skills you need to analyze data with confidence.
Table of Contents (18 chapters)
1
Section 1: Data Analysis Essentials
7
Section 2: Solutions for Data Discovery
12
Section 3: Working with Unstructured Big Data
16
Works Cited

Understanding data literacy

Data literacy is defined by Rahul Bhargava and Catherine D'Ignazio as the ability to read, work with, analyze, and arguewith data. Throughout this chapter, I have pointed out how data comes in all shapes and sizes, so creating a common framework to communicate about data between different audiences becomes an important skill to master.

Data literacy becomes a common denominator for answering data questions between two or more people with different skills or experience. For example, if a sales manager wants to verify the data behind a chart in a quarterly report, having them fluent in the language of data will save time. Time is saved by asking direct questions about the data types and data attributes with the engineering team versus searching for those details aimlessly.

Let's break down the concepts of data literacy to help to identify how it can be applied to your personal and professional life.

Reading data

What does it mean to read data? Reading data is consuming information, and that information can be in any format including a chart, a table, code, or the body of an email.

Reading data may not necessarily provide the consumer with all of the answers to their questions. Having domain expertise may be required to understand how, when, and why a dataset was created to allow the consumer to fully interpret the underlying dataset.

For example, you are a data analyst and your colleague sends a file attachment to your email with the subject line as FYI and no additional information in the body of the message. We now know from the What makes a good data analyst? section that we should start asking questions about the file attachment:

  • What methods were used to create the file (human or machine)?
  • What system(s) and workflow were used to create the file?
  • Who created the file and when was it created?
  • How often does this file refresh and is it manual or automated?

Asking these questions helps you to understand the concept of data lineage, which can identify the process of how a dataset was created. This will ensure reading the data will result in understanding all aspects to focus on making decisions from it confidently.

Working with data

I define working withdata as the person or system that creates a dataset using any technology. The technologies used to create data are vastly varied and could be as simple as someone typing rows and columns in spreadsheets, to having a software developer use loops and functions in Python code to create a pipe-delimited file.

Since writing data varies by expertise and job function, a key takeaway from a data literacy perspective is that the producer of data should be conscious of how it will be consumed. Ideally, the producer should document the details of how, when, and where the data was created to include the frequency of how often it is refreshed. Publishing this information democratizes the metadata (data about the data) to improve the communication between anyone reading and working with the data.

For example, if you have a timestamp field in your dataset, is it using UTC (Coordinated Universal Time) or EST (Eastern Standard Time)? By including assumptions and reasons why the data is stored in a specific format, the person or team working with the data become good data citizens by improving the communication for analysis.

Analyzing data

Analyzing data begins with modeling and structuring it to answer business questions. Data modeling is a vast topic but for data literacy purposes, it can be boiled down to dimensions and measures. Dimensions are distinct nouns such as a person, place, or thing, and measures are verbs based on actions and then aggregated (sum, count, min, max, and average).

The foundation for building any data visualization and charts is rooted in data modeling and most modern tech solutions have it built in so you may be already modeling data without even realizing it.

One quick solution to help to classify how the data should be used for analysis would be a data dictionary, which is defined as a centralized repository of information about data such as meaning, relationships to other data, origin, usage, and format.

You might be able to find a data dictionary in the help pages of source systems or from GitHub repositories. If you don't receive one from the creator of the file, you can create one for yourself and use it to ask questions about the data including assumed data types, data quality, and identifying data gaps.

Creating a data dictionary also helps to validate assumptions and is an aid to frame questions about the data when communicating with others. The easiest method to create a data dictionary would be to transpose the first few rows of the source data so the rows turn into columns. If your data has a header row, then the first row turns into a list of all fields available. Let's walk through an example of how to create your own data dictionary from data. Here, we have a sourceSalestable representingProductandCustomersales by quarter:

Product

Customer

Quarter 1

Quarter 2

Quarter 3

Quarter 4

Product 1

Customer A

$ 1,000.00

$ 2,000.00

$ 6,000.00

Product 1

Customer B

$ 1,000.00

$ 500.00

Product 2

Customer A

$ 1,000.00

Product 2

Customer C

$ 2,000.00

$ 2,500.00

$ 5,000.00

Product 3

Customer A

$ 1,000.00

$ 2,000.00

Product 4

Customer B

$ 1,000.00

$ 3,000.00

Product 5

Customer A

$ 1,000.00

In the following table, I have transposed the preceding source table to create a new table for analysis, which creates an initial data dictionary. The first column on the left becomes a list of all of the fields available from the source table. As you can see from the fields, Record 1 to Record 3 in the header row now become sample rows of data but retain the integrity of each row from the source table. The last two columns on the right in the following table, labeled Estimated Data Type and Dimension or Measure, were added to help to define the use of this data for analysis. Understanding the data type and classifying each field as a dimension or measure will help to determine what type of analysis we can perform and how each field can be used in data visualizations:

Field Name

Record 1

Record 2

Record 3

Estimated Data Type

Dimension or Measure

Product

Product 1

Product 1

Product 2

varchar

Dimension

Customer

Customer A

Customer B

Customer A

varchar

Dimension

Quarter 1

$ 1,000.00

float

Measure

Quarter 2

$ 2,000.00

$ 1,000.00

$ 1,000.00

float

Measure

Quarter 3

$ 6,000.00

$ 500.00

float

Measure

Quarter 4

float

Measure

Using this technique can help you to ask the following questions about the data to ensure you understand the results:

  • What year does this dataset represent or is it an accumulation of multiple years?
  • Does each quarter represent a calendar year or fiscal year?
  • Was Product 5 first introduced in Quarter 4, because there are no prior sales for that product by any customer in Quarter 1 to Quarter 3?

Arguing about the data

Finally, let's talk about how and why we should argue about data. Challenging and defending the numbers in charts or data tables helps to build credibility and is actually done in many cases behind the scenes. For example, most data engineering teams put in various checks and balances such as alerts during ingestion to avoid missing information. Additional checks would also include rules to look into log files for anomalies or errors in the processing of data.

From a consumer's perspective, trust and verify is a good approach. For example, when looking at a chart published in a credible news article, you can assume the data behind the story is accurate but you should also verify the accuracy of the source data. The first thing to ask would be: does the underlying chart include a source to the dataset that is publicly available? The websitefivethirtyeight.comis really good at providing access to the raw data and details of methodologies used to create analysis and charts found in news stories. Exposing the underlining dataset and the process used to collect it to the public opens up conversations about the how, what, and why behind the data and is a good method to disprove misinformation.

As a data analyst and creator of data outputs, the ability to defend your work should be received with open arms. Having documentation such as a data dictionary and GitHub repository and documenting the methodology used to produce the data will build trust with the audience and reduce the time for them to make data-driven decisions.

Hopefully, you now see the importance of data literacy and how it can be used to improve all aspects of communication of data between consumers and producers. With any language, practice will lead to improvement, so I invite you to explore some useful free datasets to improve your data literacy.

Here are a few to get started:

Let's begin with the Kagglesite, which was created to help companies to host data science competitions to solve complex problems using data. Improve your reading and working with data literacy skills by exploring these datasets and walking through the concepts learned in this chapter such as identifying the data type for each field and confirming a data dictionary exists.

Next is the supporting data from FiveThirtyEight, which is a data journalism site providing analytic content from sports to politics. What I like about their process is the offer of transparency behind the news stories published by exposing open GitHub links to their source data and discussions about their methodology behind the data.

Another important open source for data would be The World Bank, which offers a plethora of options to consume or produce data across the world to help to improve life through data. Most of the datasets are licensed under a Creative Commons license, which governs the terms of how and when data can be used, but making them freely available opens up opportunities to blend public and private data together with significant time savings.