Book Image

Mastering Machine Learning with scikit-learn - Second Edition

By : Gavin Hackeling
Book Image

Mastering Machine Learning with scikit-learn - Second Edition

By: Gavin Hackeling

Overview of this book

Machine learning is the buzzword bringing computer science and statistics together to build smart and efficient models. Using powerful algorithms and techniques offered by machine learning you can automate any analytical model. This book examines a variety of machine learning models including popular machine learning algorithms such as k-nearest neighbors, logistic regression, naive Bayes, k-means, decision trees, and artificial neural networks. It discusses data preprocessing, hyperparameter optimization, and ensemble methods. You will build systems that classify documents, recognize images, detect ads, and more. You will learn to use scikit-learn’s API to extract features from categorical variables, text and images; evaluate model performance, and develop an intuition for how to improve your model’s performance. By the end of this book, you will master all required concepts of scikit-learn to build efficient models at work to carry out advanced tasks with the practical approach.
Table of Contents (22 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
9
From Decision Trees to Random Forests and Other Ensemble Methods
Index

Naive Bayes with scikit-learn


Let's fit a Naive Bayes classifier with scikit-learn. We will compare the performances of Naive Bayes and logistic regression classifiers on increasingly large samples of two different training sets. The Breast Cancer Wisconsin dataset consists of features extracted from fine needle aspirate images of breast masses. The task is to classify masses as malignant or benign using 30 real-valued features that describe the cell nuclei in each fine needle aspirate image. The dataset has 212 malignant instances and 357 benign instances. The Pima Indians Diabetes Database task is to predict whether an individual has diabetes using eight features representing the number of times the individual has been pregnant, measures from an oral glucose tolerance test, diastolic blood pressure, triceps skin fold thickness, body mass index, age, and other diagnostics. The dataset has 268 diabetic instances and 500 non-diabetic instances:

# In[1]:
%matplotlib inline

# In[2]:
import...