## Logistic regression model building

Okay, let's get started with building a real machine learning model. First, we'll see the proposed machine learning problem: font classification. Then, we'll review a simple algorithm for classification, called **logistic regression**. Finally, we'll implement logistic regression in TensorFlow.

### Introducing the font classification dataset

Before we jump in, let's load all the necessary modules:

import tensorflow as tf import numpy as np

If you're copying and pasting to IPython, make sure your `autoindent`

property is set to `OFF`

:

%autoindent

The `tqdm`

module is optional; it just shows nice progress bars:

try: from tqdm import tqdm except ImportError: def tqdm(x, *args, **kwargs): return x

Next, we'll set a seed of `0`

, just to get consistent data splitting from run to run:

# Set random seed np.random.seed(0)

In this book, we've provided a dataset of the images of characters using five fonts. For convenience, these are stored in a compressed NumPy file (`data_with_labels.npz`

), which can be found in the download package of this book. You can easily load these into Python with `numpy.load`

:

# Load data data = np.load('data_with_labels.npz') train = data['arr_0']/255. labels = data['arr_1']

The `train`

variable here holds the actual pixel values scaled from 0 to 1, and `labels`

holds the type of font that it was; therefore, it'll be either 0, 1, 2, 3, or 4, as there are five fonts in total. You can print out these values, so you can look at them using the following code:

# Look at some data print(train[0]) print(labels[0])

However, that's not very instructive, as most of the values are zeroes and only the central part of the screen contains the image data:

If you have Matplotlib installed, now is a good place to import it. We'll use `plt.ion()`

to automatically bring up figures when needed:

# If you have matplotlib installed import matplotlib.pyplot as plt plt.ion()

Here are some example images of characters from each font:

Yeah, they're pretty flashy. In the dataset, each image is represented as a 36 x 36 two-dimensional matrix of pixel darkness values. The 0 value represents a white pixel, while 255 represents a black pixel. Everything in between is a shade of gray. Here's the code to display these fonts on your own machine:

# Let's look at a subplot of one of A in each font f, plts = plt.subplots(5, sharex=True) c = 91 for i in range(5): plts[i].pcolor(train[c + i * 558], cmap=plt.cm.gray_r)

If your plot appears really wide, you can easily resize the window just using your mouse. It's often much more work to resize it ahead of time in Python if you're simply plotting interactively. Our goal is to decide which font an image belongs to, given that we have many other labeled images of the fonts. To expand the dataset and help avoid overfitting, we have also *jittered* each character around in the 36 x 36 area, giving us nine times as many data points.

It may be helpful to come back to this after working with later models. It's important to keep the original data in mind, no matter how advanced the final model is.

### Logistic regression

If you're familiar with linear regression, you're halfway toward understanding logistic regression. Basically, we're going to assign a weight to each pixel in the image, then take the weighted sum of those pixels (beta for weights and *X* for pixels). This will give us a score for that image being a particular font. Every font will have its own set of weights, as they will value pixels differently. To convert these scores into proper probabilities (represented by *Y*), we will use what's called the `softmax`

function to force their sum to be between 0 and 1, as illustrated next. Whatever probability is the greatest for a particular image, we will classify it into the associated class.

You can read more about the theory of logistic regression in most statistical modeling textbooks. Here is its formula:

One good reference that focuses on applications is William H. Greene's *Econometric Analysis*, *Pearson*, published in the year 2012.

### Getting data ready

Implementing logistic regression is pretty easy in TensorFlow and will serve as scaffolding for more complex machine learning algorithms. First, we need to convert our integer labels into a *one-hot* format. This means, instead of labeling an image with font class 2, we transform the label into [0, 0, 1, 0, 0]. That is, we stick `1`

in position two (note 0-up counting is common in computer science) and `0`

for every other class. Here's the code for our `to_onehot`

function:

def to_onehot(labels,nclasses = 5): ''' Convert labels to "one-hot" format. >>> a = [0,1,2,3] >>> to_onehot(a,5) array([[ 1., 0., 0., 0., 0.], [ 0., 1., 0., 0., 0.], [ 0., 0., 1., 0., 0.], [ 0., 0., 0., 1., 0.]]) ''' outlabels = np.zeros((len(labels),nclasses)) for i,l in enumerate(labels): outlabels[i,l] = 1 return outlabels

With this done, we can go ahead and call the function:

onehot = to_onehot(labels)

For the pixels, we don't really want a matrix in this case, so we'll flatten the 36 x 36 numbers into a one-dimensional vector of length 1,296, but this will come a little bit later. Also, recall that we've rescaled the pixel values of 0-255 so that they fall between 0 and 1.

Okay, our final piece of preparation is to split our dataset into training and validation sets. This will help us catch overfitting later on. The training set will help us determine the weights in our logistic regression model, and the validation set will just be used to confirm that those weights are reasonably correct on new data:

# Split data into training and validation indices = np.random.permutation(train.shape[0]) valid_cnt = int(train.shape[0] * 0.1) test_idx, training_idx = indices[:valid_cnt],\ indices[valid_cnt:] test, train = train[test_idx,:],\ train[training_idx,:] onehot_test, onehot_train = onehot[test_idx,:],\ onehot[training_idx,:]

### Building a TensorFlow model

Okay, let's kick off the TensorFlow code by creating an interactive session:

sess = tf.InteractiveSession()

With this, we've started our first model in TensorFlow.

We're going to use a placeholder variable for `x`

, which represents our input images. This is just to tell TensorFlow that we will supply the value for this node via `feed_dict`

later on:

# These will be inputs ## Input pixels, flattened x = tf.placeholder("float", [None, 1296])

Also, note that we can specify the shape of this tensor, and here we have used `None`

as one of the sizes. The `None`

size allows us to send an arbitrary number of data points into the algorithm at once for batch processing. We'll use the variable `y_`

likewise to hold our known labels to be used for training later on:

## Known labels y_ = tf.placeholder("float", [None,5])

To perform logistic regression, we need a set of weights (`W`

). In fact, we need 1,296 weights for each of the five font classes, which will give us our shape. Note that we also want to include an extra weight for each class as a bias (`b`

). This is the same as adding an extra input variable that always takes the value `1`

:

# Variables W = tf.Variable(tf.zeros([1296,5])) b = tf.Variable(tf.zeros([5]))

With all these TensorFlow variables floating around, we need to make sure they get initialized. Let's call them now:

# Just initialize sess.run(tf.global_variables_initializer())

Good job! You've got everything prepared. Now you can implement the `softmax`

formula to compute probabilities. Because we set up our weights and input very carefully, TensorFlow makes this task very easy with just a call to `tf.matmul`

and `tf.nn.softmax`

:

# Define model y = tf.nn.softmax(tf.matmul(x,W) + b)

That's it! You've implemented an entire machine learning classifier in TensorFlow. Nice work. But where do we get the values for the weights? Let's take a look at using TensorFlow to train the model.