Book Image

The Applied Data Science Workshop - Second Edition

By : Alex Galea
Book Image

The Applied Data Science Workshop - Second Edition

By: Alex Galea

Overview of this book

From banking and manufacturing through to education and entertainment, using data science for business has revolutionized almost every sector in the modern world. It has an important role to play in everything from app development to network security. Taking an interactive approach to learning the fundamentals, this book is ideal for beginners. You’ll learn all the best practices and techniques for applying data science in the context of real-world scenarios and examples. Starting with an introduction to data science and machine learning, you’ll start by getting to grips with Jupyter functionality and features. You’ll use Python libraries like sci-kit learn, pandas, Matplotlib, and Seaborn to perform data analysis and data preprocessing on real-world datasets from within your own Jupyter environment. Progressing through the chapters, you’ll train classification models using sci-kit learn, and assess model performance using advanced validation techniques. Towards the end, you’ll use Jupyter Notebooks to document your research, build stakeholder reports, and even analyze web performance data. By the end of The Applied Data Science Workshop, you’ll be prepared to progress from being a beginner to taking your skills to the next level by confidently applying data science techniques and tools to real-world projects.
Table of Contents (8 chapters)

5. Model Validation and Optimization

Activity 5.01: Hyperparameter Tuning and Model Selection

Solution:

  1. Create a new Jupyter notebook and load the following libraries:
    import pandas as pd
    import numpy as np
    import datetime
    import time
    import os
    import matplotlib.pyplot as plt
    %matplotlib inline
    import seaborn as sns
    %config InlineBackend.figure_format='retina'
    sns.set() # Revert to matplotlib defaults
    plt.rcParams['figure.figsize'] = (9, 6)
    plt.rcParams['axes.labelpad'] = 10
    sns.set_style("darkgrid")
    %load_ext watermark
    %watermark -d -v -m -p \
    numpy,pandas,matplotlib,seaborn,sklearn
  2. Load the preprocessed Human Resource Analytics dataset by running the following code:
    df = pd.read_csv('../data/hr-analytics/hr_data_processed_pca.csv')
    df.columns

    This displays the following output:

    Figure 5.10: The columns of hr_data_processed_pca.csv

  3. Select the features to include in the model and perform a train-test split on the data...