Initial Data Analysis

As a rule of thumb, when starting the analysis of a new dataset, it is good practice to check the dimensionality of the data, type of columns, possible missing values, and some generic statistics on the numerical columns. We can also get the first 5 to 10 entries in order to acquire a feeling for the data itself. We'll perform these steps in the following code snippets:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
# import data from the GitHub page of the book
data = pd.read_csv('https://raw.githubusercontent.com'\
                   '/PacktWorkshops/The-Data-Analysis-Workshop'\
                   '/master/Chapter02/data/'\
                   'Absenteeism_at_work.csv', sep=";")

Note that we are providing the separator parameter when reading the data because, although the original data file is in the CSV format, the ";" symbol has been used to separate the various fields.

In order to print the dimensionality of the data, column types, and the number of missing values, we can use the following code:

"""
print dimensionality of the data, columns, types and missing values
"""
print(f"Data dimension: {data.shape}")
for col in data.columns:
    print(f"Column: {col:35} | type: {str(data[col].dtype):7} \
| missing values: {data[col].isna().sum():3d}")

This returns the following output:

Figure 2.1: Dimensions of the Absenteeism_at_work dataset

As we can see from these 21 columns, only one (Work Load Average/day) does not contain integer values. Since no missing values are present in the data, we can consider it quite clean. We can also derive some basic statistics by using the describe method:

# compute statistics on numerical features
data.describe().T

The output will be as follows:

Figure 2.2: Output of the describe() method

Note that some of the columns, such as Month of absence, Day of the week, Seasons, Education, Disciplinary failure, Social drinker, and Social smoker, are encoding categorical values. So, we can back-transform the numerical values to their original categories so that we have better plotting features. We will perform the transformation by defining a Python dict object containing the mapping and then applying the apply() function to each feature, which applies the provided function to each of the values in the column. First, let's define the encoding dict objects:

# define encoding dictionaries
month_encoding = {1: "January", 2: "February", 3: "March", \
                  4: "April", 5: "May", 6: "June", 7: "July", \
                  8: "August", 9: "September", 10: "October", \
                  11: "November", 12: "December", 0: "Unknown"}
dow_encoding = {2: "Monday", 3: "Tuesday", 4: "Wednesday", \
                5: "Thursday", 6: "Friday"}
season_encoding = {1: "Spring", 2: "Summer", 3: "Fall", 4: "Winter"}
education_encoding = {1: "high_school", 2: "graduate", \
                      3: "postgraduate", 4: "master_phd"}
yes_no_encoding = {0: "No", 1: "Yes"}

Afterward, we apply the encoding dictionaries to the relevant features:

# backtransform numerical variables to categorical
preprocessed_data = data.copy()
preprocessed_data["Month of absence"] = preprocessed_data\
                                        ["Month of absence"]\
                                        .apply(lambda x: \
                                               month_encoding[x])
preprocessed_data["Day of the week"] = preprocessed_data\
                                       ["Day of the week"]\
                                       .apply(lambda x: \
                                              dow_encoding[x])
preprocessed_data["Seasons"] = preprocessed_data["Seasons"]\
                              .apply(lambda x: season_encoding[x])
preprocessed_data["Education"] = preprocessed_data["Education"]\
                                 .apply(lambda x: \
                                        education_encoding[x])
preprocessed_data["Disciplinary failure"] = \
preprocessed_data["Disciplinary failure"].apply(lambda x: \
                                                yes_no_encoding[x])
preprocessed_data["Social drinker"] = \
preprocessed_data["Social drinker"].apply(lambda x: \
                                          yes_no_encoding[x])
preprocessed_data["Social smoker"] = \
preprocessed_data["Social smoker"].apply(lambda x: \
                                         yes_no_encoding[x])
# transform columns
preprocessed_data.head().T

The output will be as follows:

Figure 2.3: Transformation of columns

In the previous code snippet, we created a clean copy of the original dataset by calling the .copy() method on the data object. In this way, a new copy of the original data is created. This is a convenient way to create new pandas DataFrames, without taking the risk of modifying the original raw data (as it might serve us later). Afterward, we created a set of dictionaries where the numerical values are keys and the categorical values are values. Finally, we used the .apply() method on each column we wanted to encode by mapping each value in the original column to its corresponding value in the encoding dictionary, which contains the target values. Note that in the Month of absence column, a 0 value is present, which is encoded as Unknown as no month corresponds to 0.

Based on the description of the data, the Reason for absence column contains information about the absence, which is encoded based on the International Code of Diseases (ICD). The following table represents the various encodings:

Figure 2.4: Reason for absence encoding

Note that only values 1 to 21 represent ICD encoding; values 22 to 28 are separate reasons, which do not represent a disease, while value 0 is not defined—hence the encoded reason Unknown. As all values contained in the ICD represent some type of disease, it makes sense to create a new binary variable that indicates whether the current reason for absence is related to some sort of disease or not. We will do this in the following exercise.

Exercise 2.01: Identifying Reasons for Absence

In this exercise, you will create a new variable, called Disease, which indicates whether a specific reason for absence is present in the ICD table or not. Please complete the initial data analysis before you begin this exercise. Now, follow these steps:

First, define a function that returns Yes if a provided encoded value is contained in the ICD (values 1 to 21); otherwise, No:

"""
define function, which checks if the provided integer value 
is contained in the ICD or not
"""
def in_icd(val):
    return "Yes" if val >= 1 and val <= 21 else "No"

Combine the .apply() method with the previously defined in_icd() function in order to create the new Disease column in the preprocessed dataset:
```
# add Disease column
preprocessed_data["Disease"] = \
preprocessed_data["Reason for absence"].apply(in_icd)
```

Use bar plots in order to compare the absences due to disease reasons:

plt.figure(figsize=(10, 8))
sns.countplot(data=preprocessed_data, x='Disease')
plt.savefig('figs/disease_plot.png', format='png', dpi=300)

The output will be as follows:

Figure 2.5: Comparing absence count to disease

Here, we are using the seaborn .countplot() function, which is quite handy when creating this type of bar plot, in which we want to know the total number of entries for each specific class. As we can see, the number of reasons for absence that are not listed in the ICD table is almost twice the number of listed ones.

Note

To access the source code for this specific section, please refer to https://packt.live/2B9AqVJ.

You can also run this example online at https://packt.live/2UPwIr1. You must execute the entire Notebook in order to get the desired result.

In this section, we performed some simple data exploration and transformations on the initial absenteeism dataset. In the next section, we will go deeper into our data exploration and analyze some of the possible reasons for absence.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

The Data Analysis Workshop

By : Gururajan Govindan , Shubhangi Hora , Konstantin Palagachev , Brent Broadnax, John Wesley Doyle , Ashish Jain, Robert Thas John, Ravi Ranjan Prasad Karn, Pritesh Tiwari

The Data Analysis Workshop

By: Gururajan Govindan , Shubhangi Hora , Konstantin Palagachev , Brent Broadnax, John Wesley Doyle , Ashish Jain, Robert Thas John, Ravi Ranjan Prasad Karn, Pritesh Tiwari

Overview of this book

Initial Data Analysis

Exercise 2.01: Identifying Reasons for Absence

The Data Analysis Workshop

By : Gururajan Govindan , Shubhangi Hora , Konstantin Palagachev , Brent Broadnax, John Wesley Doyle , Ashish Jain, Robert Thas John, Ravi Ranjan Prasad Karn, Pritesh Tiwari

The Data Analysis Workshop

By: Gururajan Govindan , Shubhangi Hora , Konstantin Palagachev , Brent Broadnax, John Wesley Doyle , Ashish Jain, Robert Thas John, Ravi Ranjan Prasad Karn, Pritesh Tiwari

Overview of this book

Initial Data Analysis

Exercise 2.01: Identifying Reasons for Absence

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access