Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying Data Labeling in Machine Learning with Python
  • Table Of Contents Toc
Data Labeling in Machine Learning with Python

Data Labeling in Machine Learning with Python

By : Vijaya Kumar Suda
5 (3)
close
close
Data Labeling in Machine Learning with Python

Data Labeling in Machine Learning with Python

5 (3)
By: Vijaya Kumar Suda

Overview of this book

Data labeling is the invisible hand that guides the power of artificial intelligence and machine learning. In today’s data-driven world, mastering data labeling is not just an advantage, it’s a necessity. Data Labeling in Machine Learning with Python empowers you to unearth value from raw data, create intelligent systems, and influence the course of technological evolution. With this book, you'll discover the art of employing summary statistics, weak supervision, programmatic rules, and heuristics to assign labels to unlabeled training data programmatically. As you progress, you'll be able to enhance your datasets by mastering the intricacies of semi-supervised learning and data augmentation. Venturing further into the data landscape, you'll immerse yourself in the annotation of image, video, and audio data, harnessing the power of Python libraries such as seaborn, matplotlib, cv2, librosa, openai, and langchain. With hands-on guidance and practical examples, you'll gain proficiency in annotating diverse data types effectively. By the end of this book, you’ll have the practical expertise to programmatically label diverse data types and enhance datasets, unlocking the full potential of your data.
Table of Contents (18 chapters)
close
close
1
Part 1: Labeling Tabular Data
5
Part 2: Labeling Image Data
9
Part 3: Labeling Text, Audio, and Video Data

Introducing Pandas DataFrames

Pandas is an open source library used for data analysis and manipulation. It provides various functions for data wrangling, cleaning, and merging operations. Let us see how to explore data using the pandas library. For this, we will use the Income dataset located on GitHub and explore it to find the following insights:

  • How many unique values are there for age, education, and profession in the Income dataset? What are the observations for each unique age?
  • Summary statistics such as mean value and quantile values for each feature. What is the average age of the adult for the income range > $50K?
  • How is income dependent on independent variables such as age, education, and profession using bivariate analysis?

Let us first read the data into a DataFrame using the pandas library.

A DataFrame is a structure that represents two-dimensional data with columns and rows, and it is similar to a SQL table. To get started, ensure that you create the requirements.txt file and add the required Python libraries as follows:

Figure 1.2 – Contents of the requirements.txt file

Figure 1.2 – Contents of the requirements.txt file

Next, run the following command from your Python notebook cell to install the libraries added in the requirements.txt file:

%pip install -r requirements.txt

Now, let’s import the required Python libraries using the following import statements:

# import libraries for loading dataset
import pandas as pd
import numpy as np
# import libraries for plotting
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import rcParams
%matplotlib inline
plt.style.use('dark_background')
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

Next, in the following code snippet, we are reading the adult_income.csv file and writing to the DataFrame (df):

# loading the dataset
df = pd.read_csv("<your file path>/adult_income.csv", encoding='latin-1)'

Now the data is loaded to df.

Let us see the size of the DataFrame using the following code snippet:

df.shape

We will see the shape of the DataFrame as a result:

Figure 1.3 – Shape of the DataFrame

Figure 1.3 – Shape of the DataFrame

So, we can see that there are 32,561 observations (rows) and 15 features (columns) in the dataset.

Let us print the 15 column names in the dataset:

df.columns

We get the following result:

Figure 1.4 – The names of the columns in our dataset

Figure 1.4 – The names of the columns in our dataset

Now, let’s see the first five rows of the data in the dataset with the following code:

df.head()

We can see the output in Figure 1.5:

Figure 1.5 – The first five rows of data

Figure 1.5 – The first five rows of data

Let’s see the last five rows of the dataset using tail, as shown in the following figure:

df.tail()

We will get the following output.

Figure 1.6 – The last five rows of data

Figure 1.6 – The last five rows of data

As we can see, education and education.num are redundant columns, as education.num is just the ordinal representation of the education column. So, we will remove the redundant education.num column from the dataset as one column is enough for model training. We will also drop the race column from the dataset using the following code snippet as we will not use it here:

# As we observe education and education.num both are the same , so we can drop one of the columns
df.drop(['education.num'], axis = 1, inplace = True)
df.drop(['race'], axis = 1, inplace = True)

Here, axis = 1 refers to the columns axis, which means that you are specifying that you want to drop a column. In this case, you are dropping the columns labeled education.num and race from the DataFrame.

Now, let’s print the columns using info() to make sure the race and education.num columns are dropped from the DataFrame:

df.info()

We will see the following output:

Figure 1.7 – Columns in the DataFrame

Figure 1.7 – Columns in the DataFrame

We can see in the preceding data there are now only 13 columns as we deleted 2 of them from the previous total of 15 columns.

In this section, we have seen what a Pandas DataFrame is and loaded a CSV dataset into one. We also saw the various columns in the DataFrame and their data types. In the following section, we will generate the summary statistics for the important features using Pandas.

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
Data Labeling in Machine Learning with Python
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon