Book Image

TensorFlow Machine Learning Cookbook. - Second Edition

By : Sujit Pal, Nick McClure
Book Image

TensorFlow Machine Learning Cookbook. - Second Edition

By: Sujit Pal, Nick McClure

Overview of this book

TensorFlow is an open source software library for Machine Intelligence. The independent recipes in this book will teach you how to use TensorFlow for complex data computations and allow you to dig deeper and gain more insights into your data than ever before. With the help of this book, you will work with recipes for training models, model evaluation, sentiment analysis, regression analysis, clustering analysis, artificial neural networks, and more. You will explore RNNs, CNNs, GANs, reinforcement learning, and capsule networks, each using Google's machine learning library, TensorFlow. Through real-world examples, you will get hands-on experience with linear regression techniques with TensorFlow. Once you are familiar and comfortable with the TensorFlow ecosystem, you will be shown how to take it to production. By the end of the book, you will be proficient in the field of machine intelligence using TensorFlow. You will also have good insight into deep learning and be capable of implementing machine learning algorithms in real-world scenarios.
Table of Contents (13 chapters)

Working with data sources

For most of this book, we will rely on the use of datasets to fit machine learning algorithms. This section has instructions on how to access each of these datasets through TensorFlow and Python.

Some of the data sources rely on the maintenance of outside websites so that you can access the data. If these websites change or remove this data, then some of the following code in this section may need to be updated. You may find the updated code at the author's GitHub page: https://github.com/nfmcclure/tensorflow_cookbook.

Getting ready

In TensorFlow, some of the datasets that we will use are built into Python libraries, some will require a Python script to download, and some will be manually downloaded through the internet. Almost all of these datasets will require an active internet connection so that you can retrieve them.

How to do it...

  1. Iris data: This dataset is arguably the most classic dataset used in machine learning and maybe all of statistics. It is a dataset that measures sepal length, sepal width, petal length, and petal width of three different types of iris flowers: Iris setosa, Iris virginica, and Iris versicolor. There are 150 measurements overall, which means that there are 50 measurements of each species. To load the dataset in Python, we will use scikit-learn's dataset function, as follows:
from sklearn import datasets 
iris = datasets.load_iris() 
print(len(iris.data)) 
150 
print(len(iris.target)) 
150 
print(iris.data[0]) # Sepal length, Sepal width, Petal length, Petal width 
[ 5.1 3.5 1.4 0.2] 
print(set(iris.target)) # I. setosa, I. virginica, I. versicolor 
{0, 1, 2} 
  1. Birth weight data: This data was originally from Baystate Medical Center, Springfield, Mass 1986 (1). This dataset contains the measure of child birth weight and other demographic and medical measurements of the mother and family history. There are 189 observations of eleven variables. The following code shows you how you can access this data in Python:
import requests
birthdata_url = 'https://github.com/nfmcclure/tensorflow_cookbook/raw/master/01_Introduction/07_Working_with_Data_Sources/birthweight_data/birthweight.dat'
birth_file = requests.get(birthdata_url)
birth_data = birth_file.text.split('\r\n')
birth_header = birth_data[0].split('\t')
birth_data = [[float(x) for x in y.split('\t') if len(x)>=1] for y in birth_data[1:] if len(y)>=1]
print(len(birth_data)) 189 print(len(birth_data[0])) 9

  1. Boston Housing data: Carnegie Mellon University maintains a library of datasets in their StatLib Library. This data is easily accessible via The University of California at Irvine's machine learning Repository (https://archive.ics.uci.edu/ml/index.php). There are 506 observations of house worth, along with various demographic data and housing attributes (14 variables). The following code shows you how to access this data in Python, via the Keras library:
from keras.datasets import boston_housing
(x_train, y_train), (x_test, y_test) = boston_housing.load_data()
housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] print(x_train.shape[0]) 404 print(x_train.shape[1]) 13
  1. MNIST handwriting data: The MNIST (Mixed National Institute of Standards and Technology) dataset is a subset of the larger NIST handwriting database. The MNIST handwriting dataset is hosted on Yann LeCun's website (https://yann.lecun.com/exdb/mnist/). It is a database of 70,000 images of single-digit numbers (0-9) with about 60,000 annotated for a training set and 10,000 for a test set. This dataset is used so often in image recognition that TensorFlow provides built-in functions to access this data. In machine learning, it is also important to provide validation data to prevent overfitting (target leakage). Because of this, TensorFlow sets aside 5,000 images of the train set into a validation set. The following code shows you how to access this data in Python:
from tensorflow.examples.tutorials.mnist import input_data 
mnist = input_data.read_data_sets("MNIST_data/"," one_hot=True) 
print(len(mnist.train.images)) 
55000 
print(len(mnist.test.images)) 
10000 
print(len(mnist.validation.images)) 
5000 
print(mnist.train.labels[1,:]) # The first label is a 3 
[ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.] 
  1. Spam-ham text data. UCI's machine learning dataset library also holds a spam-ham text message dataset. We can access this .zip file and get the spam-ham text data as follows:
import requests 
import io 
from zipfile import ZipFile 
zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip' 
r = requests.get(zip_url) 
z = ZipFile(io.BytesIO(r.content)) 
file = z.read('SMSSpamCollection') 
text_data = file.decode() 
text_data = text_data.encode('ascii',errors='ignore') 
text_data = text_data.decode().split('\n') 
text_data = [x.split('\t') for x in text_data if len(x)>=1] 
[text_data_target, text_data_train] = [list(x) for x in zip(*text_data)] 
print(len(text_data_train)) 
5574 
print(set(text_data_target)) 
{'ham', 'spam'} 
print(text_data_train[1]) 
Ok lar... Joking wif u oni... 
  1. Movie review data: Bo Pang from Cornell has released a movie review dataset that classifies reviews as good or bad (3). You can find the data on the following website: http://www.cs.cornell.edu/people/pabo/movie-review-data/. To download, extract, and transform this data, we can run the following code:
import requests 
import io 
import tarfile 
movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz' 
r = requests.get(movie_data_url) 
# Stream data into temp object 
stream_data = io.BytesIO(r.content) 
tmp = io.BytesIO() 
while True: 
    s = stream_data.read(16384) 
    if not s: 
        break 
    tmp.write(s) 
    stream_data.close() 
tmp.seek(0) 
# Extract tar file 
tar_file = tarfile.open(fileobj=tmp, mode="r:gz") 
pos = tar_file.extractfile('rt-polaritydata/rt-polarity.pos') 
neg = tar_file.extractfile('rt-polaritydata/rt-polarity.neg') 
# Save pos/neg reviews (Also deal with encoding) 
pos_data = [] 
for line in pos: 
    pos_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode()) 
neg_data = [] 
for line in neg: 
    neg_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode()) 
tar_file.close() 
print(len(pos_data)) 
5331 
print(len(neg_data)) 
5331 
# Print out first negative review 
print(neg_data[0]) 
simplistic , silly and tedious . 
  1. CIFAR-10 image data: The Canadian Institute For Advanced Research has released an image set that contains 80 million labeled colored images (each image is scaled to 32 x 32 pixels). There are 10 different target classes (airplane, automobile, bird, and so on). CIFAR-10 is a subset that includes 60,000 images. There are 50,000 images in the training set, and 10,000 in the test set. Since we will be using this dataset in multiple ways, and because it is one of our larger datasets, we will not run a script each time we need it. To get this dataset, please navigate to http://www.cs.toronto.edu/~kriz/cifar.html and download the CIFAR-10 dataset. We will address how to use this dataset in the appropriate chapters.
  2. The works of Shakespeare text data: Project Gutenberg (5) is a project that releases electronic versions of free books. They have compiled all of the works of Shakespeare together. The following code shows you how to access this text file through Python:
import requests 
shakespeare_url = 'http://www.gutenberg.org/cache/epub/100/pg100.txt' 
# Get Shakespeare text 
response = requests.get(shakespeare_url) 
shakespeare_file = response.content 
# Decode binary into string 
shakespeare_text = shakespeare_file.decode('utf-8') 
# Drop first few descriptive paragraphs. 
shakespeare_text = shakespeare_text[7675:] 
print(len(shakespeare_text)) # Number of characters 
5582212

  1. English-German sentence translation data: The Tatoeba project (http://tatoeba.org) collects sentence translations in many languages. Their data has been released under the Creative Commons License. From this data, ManyThings.org (http://www.manythings.org) has compiled sentence-to-sentence translations in text files that are available for download. Here, we will use the English-German translation file, but you can change the URL to whichever languages you would like to use:
import requests 
import io 
from zipfile import ZipFile 
sentence_url = 'http://www.manythings.org/anki/deu-eng.zip' 
r = requests.get(sentence_url) 
z = ZipFile(io.BytesIO(r.content)) 
file = z.read('deu.txt') 
# Format Data 
eng_ger_data = file.decode() 
eng_ger_data = eng_ger_data.encode('ascii',errors='ignore') 
eng_ger_data = eng_ger_data.decode().split('\n') 
eng_ger_data = [x.split('\t') for x in eng_ger_data if len(x)>=1] 
[english_sentence, german_sentence] = [list(x) for x in zip(*eng_ger_data)] 
print(len(english_sentence)) 
137673 
print(len(german_sentence)) 
137673 
print(eng_ger_data[10]) 
['I' won!, 'Ich habe gewonnen!'] 

How it works...

When it comes time to using one of these datasets in a recipe, we will refer you to this section and assume that the data is loaded in such a way as described in the preceding section. If further data transformation or pre-processing is needed, then such code will be provided in the recipe itself.

See also

Here are additional references for the data resources we use in this book:

  • Hosmer, D.W., Lemeshow, S., and Sturdivant, R. X. (2013) Applied Logistic Regression: 3rd Edition
  • Lichman, M. (2013). UCI machine learning repository http://archive.ics.uci.edu/ml Irvine, CA: University of California, School of Information and Computer Science
  • Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan, Thumbs up? Sentiment Classification using machine learning techniques, Proceedings of EMNLP 2002 http://www.cs.cornell.edu/people/pabo/movie-review-data/
  • Krizhevsky. (2009). Learning Multiple Layers of Features from Tiny Images http://www.cs.toronto.edu/~kriz/cifar.html
  • Project Gutenberg. Accessed April 2016 http://www.gutenberg.org/