Without further ado, let's get stuck in with building your first ML model in TensorFlow.
The problem we will tackle in this chapter is that of correctly identifying the species of Iris flower from four given feature values. This is a classic ML problem that is extremely easy to solve, but will provide us with a nice way to introduce the basics of constructing graphs, feeding data, and training an ML model in TensorFlow.
The Iris dataset is made up of 150 data points, and each data point has four corresponding features: length, petal width, sepal length, and sepal width, along with the target label. Our task is to build a model that can infer the target label of any iris given only these four features.
Let's start by loading in our data and processing it. TensorFlow has a built-in function to import this particular dataset for us, so let's go ahead and use that. As our dataset is only very small, it is practical to just load the whole dataset into memory; however, this is not recommended for larger datasets, and you will learn better ways of dealing with this issue in the coming chapters. This following code block will load our data for us, an explanation of it will follow.
import tensorflow as tf import numpy as np # Set random seed for reproducibility. np.random.seed(0) data, labels = tf.contrib.learn.datasets.load_dataset("iris") num_elements = len(labels) # Use shuffled indexing to shuffle dataset. shuffled_indices = np.arange(len(labels)) np.random.shuffle(shuffled_indices) shuffled_data = data[shuffled_indices] shuffled_labels = labels[shuffled_indices] # Transform labels into one hot vectors. one_hot_labels = np.zeros([num_elements,3], dtype=int) one_hot_labels[np.arange(num_elements), shuffled_labels] = 1 # Split data into training and testing sets. train_data = shuffled_data[0:105] train_labels = shuffled_labels[0:105] test_data = shuffled_data[105:] test_labels = shuffled_labels[105:]
Let's once again take a look at this code and see what we have done so far. After importing TensorFlow and Numpy, we load the whole dataset into memory. Our data consists of four numerical features that are represented as a vector. We have 150 total data points, so our data will be a matrix of shape 150 x 4, where each row represents a different datapoint and each column is a different feature. Each data point also has a target label associated with it, which is stored in a separate label vector.
Next, we shuffle the dataset; this is important to do, so that when we split it into training and test sets we have an even spread between both sets and don't end up with all of one type of data in one set.
After shuffling, we do some preprocessing on the data labels. The labels loaded with the dataset is just a 150-length vector of integers representing which target class each datapoint belongs to, either 1, 2, or 3 in this case. When creating machine learning models, we like to transform our labels into a new form that is easier to work with by doing something called one-hot encoding.
Rather than a single number being the label for each datapoint, we use vectors instead. Each vector will be as long as the number of different target classes you have. So for example, if you have 5 target classes then each vector will have 5 elements; if you have 1,000 target classes then each vector will have 1,000 elements. Each column in the vectors represents one of our target classes and we can use binary values to identify what class the vector is the label for. This can be done by setting all values to 0 and putting a 1 in the column for the class we want the vector label to represent.
This is easily understood with an example. For labels in this particular problem, the transformed vectors will look like this:
1 = [1,0,0] 2 = [0,1,0] 3 = [0,0,1]
Finally, we take part of our dataset and put it to one side. This is known as our test set and we will not touch it until after we have trained our model. This set is used to evaluate how well our trained model performs on new data that it hasn't seen before. There are many approaches to how you should split your data up into training and test sets, and we will go into detail about them all later in the book.
For now though, we'll do a simple 70:30 split, so we only use 70% of our total data to train our model and then test on the remaining 30%.