Getting to know your data
For many years, researchers argued about what is more important: data or algorithms. But now, it looks like the importance of data over algorithms is generally accepted among ML specialists. In most cases, we can assume that the one who has better data usually beats those with more advanced algorithms. Garbage in, garbage out—this rule holds true in ML more than anywhere else. To succeed in this domain, one need not only have data, but also needs to know his data and know what to do with it.
ML datasets are usually composed from individual observations, called samples, cases, or data points. In the simplest case, each sample has several features.
Features
When we are talking about features in the context of ML , what we mean is some characteristic property of the object or phenomenon we are investigating.
Note
Other names for the same concept you'll see in some publications are explanatory variable, independent variable, and predictor.
Features are used to distinguish objects from each other and to measure the similarity between them.
For instance:
- If the objects of our interest are books, features could be a title, page count, author's name, a year of publication, genre, and so on
- If the objects of interest are images, features could be intensities of each pixel
- If the objects are blog posts, features could be language, length, or presence of some terms
Note
It's useful to imagine your data as a spreadsheet table. In this case, each sample (data point) would be a row, and each feature would be a column. For example, Table 1.1 shows a tiny dataset of books consisting of four samples where each has eight features.
Table 1.1: an example of a ML dataset (dummy books):
Title | Author's name | Pages | Year | Genre | Average readers review score | Publisher | In stock |
Learn ML in 21 Days | Machine Learner | 354 | 2018 | Sci-Fi | 3.9 | Untitled United | False |
101 Tips to Survive an Asteroid Impact | Enrique Drills | 124 | 2021 | Self-help | 4.7 | Vacuum Books | True |
Sleeping on the Keyboard | Jessica's Cat | 458 | 2014 | Non-fiction | 3.5 | JhGJgh Inc. | True |
Quantum Screwdriver: Heritage | Yessenia Purnima | 1550 | 2018 | Sci-Fi | 4.2 | Vacuum Books | True |
Types of features
In the books example, you can see several types of features:
- Categorical or unordered: Title, author, genre, publisher. They are similar to enumeration without raw values in Swift, but with one difference: they have levels instead of cases. Important: you can't order them or say that one is bigger than another.
- Binary: The presence or absence of something, just true or false. In our case, the In stock feature.
- Real numbers: Page count, year, average reader's review score. These can be represented as float or double.
There are others, but these are by far the most common.
The most common ML algorithms require the dataset to consist of a number of samples, where each sample is represented by a vector of real numbers (feature vector), and all samples have the same number of features. The simplest (but not the best) way of translating categorical features into real numbers is by replacing them with numerical codes (Table 1.2).
Table 1.2: dummy books dataset after simple preprocessing:
Title | Author's name | Pages | Year | Genre | Average readers review score | Publisher | In stock |
0.0 | 0.0 | 354.0 | 2018.0 | 0.0 | 3.9 | 0.0 | 0.0 |
1.0 | 1.0 | 124.0 | 2021.0 | 1.0 | 4.7 | 1.0 | 1.0 |
2.0 | 2.0 | 458.0 | 2014.0 | 2.0 | 3.5 | 2.0 | 1.0 |
3.0 | 3.0 | 1550.0 | 2018.0 | 0.0 | 4.2 | 1.0 | 1.0 |
This is an example of how your dataset may look before you feed it into your ML algorithm. Later, we will discuss the nuts and bolts of data preprocessing for specific applications.
Choosing a good set of features
For ML purposes, it's necessary to choose a reasonable set of features, not too many and not too few:
- If you have too few features, this information may be not sufficient for your model to achieve the required quality. In this case, you want to construct new ones from existing features, or extract more features from the raw data.
- If you have too many features you want to select only the most informative and discriminative, because the more features you have the more complex your computations become.
How do you tell which features are most important? Sometimes common sense helps. For example, if you are building a model that recommends books for you, the genre and average rating of the book are perhaps more important features than the number of pages and year of publication. But what if your features are just pixels of a picture and you're building a face recognition system? For a black and white image of size 1024 x 768, we'd get 786,432 features. Which pixels are most important? In this case, you have to apply some algorithms to extract meaningful features. For example, in computer vision, edges, corners, and blobs are more informative features then raw pixels, so there are plenty of algorithms to extract them (Figure 1.1). By passing your image through some filters, you can get rid of unimportant information and reduce the number of features significantly; from hundreds of thousands to hundreds, or even tens. The techniques that helps to select the most important subset of features is known as feature selection, while the feature extraction techniques result in the creation of new features:
Figure 1.1: Edge detection is a common feature extraction technique in computer vision. You can still recognize the object on the right image, despite it containing significantly less information than the left one.
Feature extraction, selection, and combining is a kind of the art which is known as feature engineering. This requires not only hacking and statistical skills but also domain knowledge. We will see some feature engineering techniques while working on practical applications in the following chapters. We also will step into the exciting world of deep learning: a technique that gives a computer the ability to extract high-level abstract features from the low-level features.
The number of features you have for each sample (or length of feature vector) is usually referred to as the dimensionality of the problem. Many problems are high-dimensional, with hundreds or even thousands of features. Even worse, some of those problems are sparse; that is, for each data point, most of the features are zero or missed. This is a common situation in recommender systems. For instance, imagine yourself building the dataset of movie ratings: the rows are movies and columns are users, and in each cell, you have a rating given by the user of the movie. The majority of the cells in the table will remain empty, as most of the users will never have watched most of the movies. The opposite situation is called dense, which is when most values are in place. Many problems in natural language processing and bioinformatics are high-dimensional, sparse, or both.
Feature selection and extraction help to decrease the number of features without significant loss of information, so we also call them dimensionality reduction algorithms.
Getting the dataset
Datasets can be obtained from different sources. The ones important for us are:
- Classical datasets such as Iris (botanical measurements of flowers composed by R. Fisher in 1936), MNIST (60,000 handwritten digits published in 1998), Titanic (personal information of Titanic passengers from Encyclopedia Titanica and other sources), and others. Many classical datasets are available as part of Python and R ML packages. They represent some classical types of ML tasks and are useful for demonstrations of algorithms. Meanwhile, there is no similar library for Swift. Implementation of such a library would be straightforward and is a low-hanging fruit for anyone who wants to get some stars on GitHub.
- Open and commercial dataset repositories. Many institutions release their data for everyone's needs under different licenses. You can use such data for training production models or while collecting your own dataset.
Some public dataset repositories include:
The UCI ML repository: https://archive.ics.uci.edu/ml/datasets.html
Kaggle datasets: https://www.kaggle.com/datasets
data.world, a social network for dataset sharing: https://data.world
Note
To find more, visit the list of repositories at KDnuggets: http://www.kdnuggets.com/datasets/index.html. Alternatively, you'll find a list of datasets at Wikipedia: https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research.
- Data collection (acquisition)is required if no existing data can help you to solve your problem. This approach can be costly both in resources and time if you have to collect the data ad hoc; however, in many cases, you have data as a byproduct of some other process, and you can compose your dataset by extracting useful information from the data. For example, text corpuses can be composed by crawling Wikipedia or news sites. iOS automatically collects some useful data. HealthKit is a unified database of users' health measurements. Core Motion allows getting historical data on user's motion activities. The ResearchKit framework provides standardized routines to assess the user's health conditions. The CareKit framework standardizes the polls. Also, in some cases, useful information can be obtained from app log mining.
- In many cases, to collect data is not enough, as raw data doesn't suit many ML tasks well. So, the next step after data collection is data labeling. For example, you have collected dataset of images, so now you have to attach a label to each of them: to which category does this image belong? This can be done manually (often at expense), automatically (sometimes impossible), or semi-automatically. Manual labeling can be scaled by means of crowdsourcing platforms, like Amazon Mechanical Turk.
- Random data generation can be useful for a quick check of your ideas or in combination with the TDD approach. Also, sometimes adding some controlled randomness to your real data can improve the results of learning. This approach is known as data augmentation. For instance, this approach was taken to build an optical character recognition feature in the Google Translate mobile app. To train their model, they needed a lot of real-world photos with letters in different languages, which they didn't have. The engineering team bypassed this problem by creating a large dataset of letters with artificial reflections, smudges, and all kinds of corruptions on them. This improved the recognition quality significantly.
- Real-time data sources, such as inertial sensors, GPS, camera, microphone, elevation sensor, proximity sensor, touch screen, force touch, and Apple Watch sensors can be used to collect a standalone dataset or to train a model on the fly.
Note
Real-time data sources are especially important for the special class of ML models called online ML , which allows models to embed new data. A good example of such a situation is spam filtering, where the model should dynamically adapt to the new data. It's the opposite of batch learning, when the whole training dataset should be available from the very beginning.
Data preprocessing
The useful information in the data is usually referred to as a signal. On the other hand, the pieces of data that represent errors of different kinds and irrelevant data are known as noise. Errors can occur in the data during measurements, information transmission, or due to human errors. The goal of data cleansing procedures is to increase the signal/noise ratio. During this stage, you will usually transform all data to one format, delete entries with missed values, and check suspicious outliers (they can be both noise and signal). It is widely believed among ML engineers, that the data preprocessing stage usually consumes 90% of the time allocated for the ML project. Then, algorithm tweaking consumes another 90% of time. This statement is a joke only partially (about 10% of it). In Chapter 13, Best Practices, we are going to discuss common problems with the data and how to fix them.