Exploring Data Types
Depending on the source, raw data can be of different forms. Common forms of data include tabular data, images, video, audio, and text. For example, the output from a temperature logger (used to record the temperature at a given location over time) is tabular. Tabular data is structured with rows and columns, and, in the example of a temperature logger, each column may represent a characteristic for each record, such as the time, location, and temperature, while each row may represent the values of each record. The following table shows an example of numerical tabular data:
Image data represents another common form of raw data that is popular for building machine learning models. These models are popular due to the large volume of data that's available. With smartphones and security cameras recording all of life's moments, they have generated an enormous amount of data that can be used to train models.
The dimensions of image data for training are different than they are for tabular data. Each image has a height and width dimension, as well as a color channel adding a third dimension, and the quantity of images adding a fourth. As such, the input tensors for image data models are four-dimensional tensors, whereas the input tensors for tabular data are two-dimensional. The following figure shows an example of labeled training examples of boats and airplanes taken from the Open Images
dataset (https://storage.googleapis.com/openimages/web/index.html); the images have been preprocessed so that they all have the same height and width. This data could be used, for example, to train a binary classification model to classify images as boats or airplanes:
Other types of raw data that can be used to build machine learning models include text and audio. Like images, their popularity in the machine learning community is derived from the large amount of data that's available. Both audio and text have the challenge of having indeterminate sizes. You will explore how this challenge can be overcome later in this chapter. The following figure shows an audio sample with a sample rate of 44.1 kHz, which means the audio data is sampled 44,100 times per second. This is an example of the type of raw data that is input into virtual assistants, from which they decipher the request and act accordingly:
Now that you know about some of the types of data you may encounter when building machine learning models, in the next section, you will uncover ways to preprocess different types of data.