Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Machine Learning Fundamentals
  • Table Of Contents Toc
  • Feedback & Rating feedback
Machine Learning Fundamentals

Machine Learning Fundamentals

By : Hyatt Saleh
5 (3)
close
close
Machine Learning Fundamentals

Machine Learning Fundamentals

5 (3)
By: Hyatt Saleh

Overview of this book

As machine learning algorithms become popular, new tools that optimize these algorithms are also developed. Machine Learning Fundamentals explains you how to use the syntax of scikit-learn. You'll study the difference between supervised and unsupervised models, as well as the importance of choosing the appropriate algorithm for each dataset. You'll apply unsupervised clustering algorithms over real-world datasets, to discover patterns and profiles, and explore the process to solve an unsupervised machine learning problem. The focus of the book then shifts to supervised learning algorithms. You'll learn to implement different supervised algorithms and develop neural network structures using the scikit-learn package. You'll also learn how to perform coherent result analysis to improve the performance of the algorithm by tuning hyperparameters. By the end of this book, you will have gain all the skills required to start programming machine learning algorithms.
Table of Contents (8 chapters)
close
close

Data Representation

The main objective of machine learning is to build models by interpreting data. To do so, it is highly important to feed the data in a way that is readable by the computer. To feed data into a scikit-learn model, it must be represented as a table or matrix of the required dimension, which will be discussed in the following section.

Tables of Data

Most tables fed into machine learning problems are two-dimensional, meaning that they contain rows and columns. Conventionally, each row represents an observation (an instance), whereas each column represents a characteristic (feature) of each observation.

The following table is a fragment of a sample dataset of scikit-learn. The purpose of the dataset is to differentiate from among three types of iris plants based on their characteristics. Hence, in the table, each row embodies a plant and each column denotes the value of that feature for every plant:

Figure 1.1: An image showing the first 10 instances of the iris dataset
Figure 1.1: A table showing the first 10 instances of the iris dataset

From the preceding explanation, the following snapshot shows data that corresponds to a plant with sepal length of 5.1, sepal width of 3.5, petal length of 1.4, and petal width of 0.2. The plant belongs to the setosa species:

Figure 1.2: The first instance of the iris dataset
Figure 1.2: The first instance of the iris dataset

Note

When feeding images to a model, the tables become three-dimensional, where the rows and columns represent the dimensions of the image in pixels, while the depth represents its color scheme. If you are interested, feel free to explore more on the subject of convolutional neural networks.

Features and Target Matrices

For many data problems, one of the features of your dataset will be used as a label. This means that out of all the other features, this one is the target to which the model should generalize the data. For example, in the preceding table, we might choose the species as the target feature, and so we would like the model to find patterns based on the other features to determine whether a plant belongs to the setosa species. Therefore, it is important to learn how to separate the target matrix from the features matrix.

Features Matrix: The features matrix comprises data from each instance for all features, except the target. It can be either created using a NumPy array or a Pandas DataFrame, and its dimensions are [n_i, n_f], where n_i denotes the number of instances (such as a person) and n_f denotes the number of features (such as age). Generally, the features matrix is stored in a variable named X.

Target Matrix: Different than the features matrix, the target matrix is usually one-dimensional since it only carries one feature for all instances, meaning that its length is of value n_i (number of instances). Nevertheless, there are some occasions where multiple targets are required, and so the dimensions of the matrix become [n_i, n_t], where n_t is the number of targets to consider.

Similar to the features matrix, the target matrix is usually created as a NumPy array or a Pandas series. The values of the target array may be discrete or continuous. Generally, the target matrix is stored in a variable named Y.

Exercise 1: Loading a Sample Dataset and Creating the Features and Target Matrices

Note

All of the exercises and activities in these chapters will be primarily developed in Jupyter Notebook. It is recommended to keep a separate notebook for different assignments, unless advised otherwise. Also, to load a sample dataset, the seaborn library will be used, as it displays the data as a table. Other ways to load data will be explained in further sections.

In this exercise, we will be loading the iris dataset, and creating features and target matrices using this dataset.

Note

For the exercises and activities within this chapter, you will need to have Python 3.6, seaborn, Jupyter, Matplotlib, and Pandas installed on your system.

  1. Open a Jupyter Notebook to implement this exercise. In the cmd or terminal, navigate to the desired path and use the following command: jupyter notebook.
  2. Load the iris dataset using the seaborn library. To do so, you first need to import the seaborn library, and then use the load_dataset() function, as shown in the following code:
    import seaborn as sns
    iris = sns.load_dataset('iris')

    As we can see from the preceding code, after importing the library, a nickname is given to facilitate its use along with the script.

    The load_dataset() function loads datasets from an online repository. The data from the dataset is stored in a variable named iris.

  3. Create a variable, X, to store the features. Use the drop() function to include all of the features but the target, which in this case is named species. Then, print out the top 10 instances of the variable:
    X = iris.drop('species', axis=1)
    X.head(10)

    Note

    The axis parameter in the preceding snippet denotes whether you want to drop the label from rows (axis = 0) or columns (axis = 1).

    The printed output should look as follows:

    Figure 1.3: A screenshot showing the first 10 instances of the features matrix
    Figure 1.3: A table showing the first 10 instances of the features matrix
  4. Print the shape of your new variable using the X.shape command:
    X.shape
    (150, 4)

    The first value indicates the number of instances in the dataset (150), and the second value represents the number of features (4).

  5. Create a variable, Y, that will store the target values. There is no need to use a function for this. Use indexing to grab only the desired column. Indexing allows you to access a section of a larger element. In this case, we want to grab the column named species. Then, print out the top 10 values of the variable:
    Y = iris['species']
    Y.head(10)

    The printed output should look as follows:

    Figure 1.4: A screenshot showing the first 10 instances of the target matrix
    Figure 1.4: A screenshot showing the first 10 instances of the target matrix
  6. Print the shape of your new variable by using the Y.shape command:
    Y.shape
    (150,) 

    The shape should be one-dimensional with length equal to the number of instances (150).

Congratulations! You have successfully created the features and target matrices of a dataset.

Generally, the preferred way to represent data is by using two-dimensional tables, where the rows represent the number of observations, also known as instances, and the columns represent the characteristics of those instances, commonly known as features.

For data problems that require target labels, the data table needs to be partitioned into a features matrix and a target matrix. The features matrix will contain the values of all features but the target, for each instance, making it a two-dimensional matrix. On the other hand, the target matrix will only contain the value of the target feature for all entries, making it a one-dimensional matrix.

Activity 1: Selecting a Target Feature and Creating a Target Matrix

In this activity, we will attempt to load a dataset and create the features and target matrices by choosing the appropriate target feature for the objective of the study. Let's look at the following scenario: you work in the safety department of a cruise company. The company wants to include more lower-deck cabins, but it wants to be sure that the measure will not increase the number of fatalities in the case of an accident. The company has provided your team with a dataset of the Titanic passenger list to determine whether lower-deck passengers are less likely to survive. Your job is to select the target feature that most likely helps to achieve this objective.

Note

To choose the target feature, remember that the target should be the outcome to which we want to interpret the data for. For instance, if we want to know what features play a role in determining a plant's species, the species should be the target value.

Follow the steps below to complete this activity:

  1. Load the titanic dataset using the seaborn library. The first couple of rows should look like this:
    Figure 1.5: An image showing the first 10 instances of the titanic dataset
    Figure 1.5: An table showing the first 10 instances of the Titanic dataset
  2. Select your preferred target feature for the goal of this activity.
  3. Create both the features matrix and the target matrix. Make sure that you store the data from the features matrix in a variable, X, and the data from the target matrix in another variable, Y.
  4. Print out the shape of each of the matrices, which should match the following values:

    Features matrix: (891,14)

    Target matrix: (891)

    Note

    The solution for this activity can be found on page 178.

Visually different images
CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Machine Learning Fundamentals
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon