Labeling Data
Artificial Intelligence (AI) models are only as good as the data they are trained with. Hence good, high-quality data is vitally important.
AI algorithms generally start in a basic, simplified, form. In supervised learning, accurately labeling (also known as annotating) data is a vitally important step to train an algorithm, improve its predictions, and ensure that what it learns is right. Numerous studies, reports, and surveys show that data scientists spend anywhere between 50-80% of their time doing data preparation and preprocessing (see Figure 3.1) – and data labeling is usually a huge part of this.
Figure 3.1 – Distribution of time allocated to machine learning tasks
In this chapter, you will learn why it is important to ensure that data is labeled correctly; how this can be achieved; how to assess whether it has indeed been achieved; and in particular, how to identify annotators who have not carried out the task to...