Book Image

Data Science with Python

By : Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen
Book Image

Data Science with Python

By: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Overview of this book

Data Science with Python begins by introducing you to data science and teaches you to install the packages you need to create a data science coding environment. You will learn three major techniques in machine learning: unsupervised learning, supervised learning, and reinforcement learning. You will also explore basic classification and regression techniques, such as support vector machines, decision trees, and logistic regression. As you make your way through the book, you will understand the basic functions, data structures, and syntax of the Python language that are used to handle large datasets with ease. You will learn about NumPy and pandas libraries for matrix calculations and data manipulation, discover how to use Matplotlib to create highly customizable visualizations, and apply the boosting algorithm XGBoost to make predictions. In the concluding chapters, you will explore convolutional neural networks (CNNs), deep learning algorithms used to predict what is in an image. You will also understand how to feed human sentences to a neural network, make the model process contextual information, and create human language processing systems to predict the outcome. By the end of this book, you will be able to understand and implement any new data science algorithm and have the confidence to experiment with tools or libraries other than those covered in the book.
Table of Contents (10 chapters)

Chapter 1: Introduction to Data Science and Data Preprocessing

Activity 1: Pre-Processing Using the Bank Marketing Subscription Dataset

Solution

Let's perform various pre-processing tasks on the Bank Marketing Subscription dataset. We'll also be splitting the dataset into training and testing data. Follow these steps to complete this activity:

  1. Open a Jupyter notebook and add a new cell to import the pandas library and load the dataset into a pandas dataframe. To do so, you first need to import the library, and then use the pd.read_csv() function, as shown here:

    import pandas as pd

    Link = 'https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv'

    #reading the data into the dataframe into the object data

    df = pd.read_csv(Link, header=0)

  2. To find the number of rows and columns in the dataset, add the following code:

    #Finding number of rows and columns

    print("Number of rows and columns : ",df.shape)

    The preceding code generates the following output:

    Figure 1.60: Number of rows and columns in the dataset
    Figure 1.60: Number of rows and columns in the dataset
  3. To print the list of all columns, add the following code:

    #Printing all the columns

    print(list(df.columns))

    The preceding code generates the following output:

    Figure 1.61: List of columns present in the dataset
    Figure 1.61: List of columns present in the dataset
  4. To overview the basic statistics of each column, such as the count, mean, median, standard deviation, minimum value, maximum value, and so on, add the following code:

    #Basic Statistics of each column

    df.describe().transpose()

    The preceding code generates the following output:

    Figure 1.62: Basic statistics of each column
    Figure 1.62: Basic statistics of each column
  5. To print the basic information of each column, add the following code:

    #Basic Information of each column

    print(df.info())

    The preceding code generates the following output:

    Figure 1.63: Basic information of each column
    Figure 1.63: Basic information of each column

    In the preceding figure, you can see that none of the columns contains any null values. Also, the type of each column is provided.

  6. Now let's check for missing values and the type of each feature. Add the following code to do this:

    #finding the data types of each column and checking for null

    null_ = df.isna().any()

    dtypes = df.dtypes

    sum_na_ = df.isna().sum()

    info = pd.concat([null_,sum_na_,dtypes],axis = 1,keys = ['isNullExist','NullSum','type'])

    info

    Have a look at the output for this in the following figure:

    Figure 1.64: Information of each column stating the number of null values and the data types
    Figure 1.64: Information of each column stating the number of null values and the data types
  7. Since we have loaded the dataset into the data object, we will remove the null values from the dataset. To remove the null values from the dataset, add the following code:

    #removing Null values

    df = df.dropna()

    #Total number of null in each column

    print(df.isna().sum())# No NA

    Have a look at the output for this in the following figure:

    Figure 1.65: Features of dataset with no null values
    Figure 1.65: Features of dataset with no null values
  8. Now we check the frequency distribution of the education column in the dataset. Use the value_counts() function to implement this:

    df.education.value_counts()

    Have a look at the output for this in the following figure:

    Figure 1.66: Frequency distribution of the education column
    Figure 1.66: Frequency distribution of the education column
  9. In the preceding figure, we can see that the education column of the dataset has many categories. We need to reduce the categories for better modeling. To check the various categories in the education column, we use the unique() function. Type the following code to implement this:

    df.education.unique()

    The output is as follows:

    Figure 1.67: Various categories of the education column
    Figure 1.67: Various categories of the education column
  10. Now let's group the basic.4y, basic.9y, and basic.6y categories together and call them basic. To do this, we can use the replace function from pandas:

    df.education.replace({"basic.9y":"Basic","basic.6y":"Basic","basic.4y":"Basic"},inplace=True)

  11. To check the list of categories after grouping, add the following code:

    df.education.unique()

    Figure 1.68: Various categories of the education column
    Figure 1.68: Various categories of the education column

    In the preceding figure, you can see that basic.9y, basic.6y, and basic.4y are grouped together as Basic.

  12. Now we select and perform a suitable encoding method for the data. Add the following code to implement this:

    #Select all the non numeric data using select_dtypes function

    data_column_category = df.select_dtypes(exclude=[np.number]).columns

    The preceding code generates the following output:

    Figure 1.69: Various columns of the dataset
    Figure 1.69: Various columns of the dataset
  13. Now we define a list with all the names of the categorical features in the data. Also, we loop through every variable in the list, getting dummy variable encoded output. Add the following code to do this:

    cat_vars=data_column_category

    for var in cat_vars:

        cat_list='var'+'_'+var

        cat_list = pd.get_dummies(df[var], prefix=var)

        data1=df.join(cat_list)

        df=data1

     df.columns

    The preceding code generates the following output:

    Figure 1.70: List of categorical features in the data
    Figure 1.70: List of categorical features in the data
  14. Now we neglect the categorical column for which we have done encoding. We'll select only the numerical and encoded categorical columns. Add the code to do this:

    #Categorical features

    cat_vars=data_column_category

    #All features

    data_vars=df.columns.values.tolist()

    #neglecting the categorical column for which we have done encoding

    to_keep = []

    for i in data_vars:

        if i not in cat_vars:

            to_keep.append(i)

            

    #selecting only the numerical and encoded catergorical column

    data_final=df[to_keep]

    data_final.columns

    The preceding code generates the following output:

    Figure 1.71: List of numerical and encoded categorical columns
    Figure 1.71: List of numerical and encoded categorical columns
  15. Finally, we split the data into train and test sets. Add the following code to implement this:

    #Segregating Independent and Target variable

    X=data_final.drop(columns='y')

    y=data_final['y']

    from sklearn. model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    print("FULL Dateset X Shape: ", X.shape )

    print("Train Dateset X Shape: ", X_train.shape )

    print("Test Dateset X Shape: ", X_test.shape )

    The output is as follows:

Figure 1.72: Shape of the full, train, and test datasets
Figure 1.72: Shape of the full, train, and test datasets