Book Image

Python Machine Learning By Example

By : Yuxi (Hayden) Liu
Book Image

Python Machine Learning By Example

By: Yuxi (Hayden) Liu

Overview of this book

Data science and machine learning are some of the top buzzwords in the technical world today. A resurging interest in machine learning is due to the same factors that have made data mining and Bayesian analysis more popular than ever. This book is your entry point to machine learning. This book starts with an introduction to machine learning and the Python language and shows you how to complete the setup. Moving ahead, you will learn all the important concepts such as, exploratory data analysis, data preprocessing, feature extraction, data visualization and clustering, classification, regression and model performance evaluation. With the help of various projects included, you will find it intriguing to acquire the mechanics of several important machine learning algorithms – they are no more obscure as they thought. Also, you will be guided step by step to build your own models from scratch. Toward the end, you will gather a broad picture of the machine learning ecosystem and best practices of applying machine learning techniques. Through this book, you will learn to tackle data-driven problems and implement your solutions with the powerful yet simple language, Python. Interesting and easy-to-follow examples, to name some, news topic classification, spam email detection, online ad click-through prediction, stock prices forecast, will keep you glued till you reach your goal.
Table of Contents (9 chapters)

A very high level overview of machine learning

Machine learning mimicking human intelligence is a subfield of artificial intelligence—a field of computer science concerned with creating systems. Software engineering is another field in computer science. Generally, we can label Python programming as a type of software engineering. Machine learning is also closely related to linear algebra, probability theory, statistics, and mathematical optimization. We usually build machine learning models based on statistics, probability theory, and linear algebra, then optimize the models using mathematical optimization. The majority of us reading this book should have at least sufficient knowledge of Python programming. Those who are not feeling confident about mathematical knowledge, might be wondering, how much time should be spent learning or brushing up the knowledge of the aforementioned subjects. Don't panic. We will get machine learning to work for us without going into any mathematical details in this book. It just requires some basic, 101 knowledge of probability theory and linear algebra, which helps us understand the mechanics of machine learning techniques and algorithms. And it gets easier as we will be building models both from scratch and with popular packages in Python, a language we like and are familiar with.

Those who want to study machine learning systematically can enroll into computer science, artificial intelligence, and more recently, data science master's programs. There are also various data science bootcamps. However the selection for bootcamps is usually stricter as they are more job oriented, and the program duration is often short ranging from 4 to 10 weeks. Another option is the free massive open online courses (MOOC), for example, the popular one is Andrew Ng's Machine Learning. Last but not least, industry blogs and websites are great resources for us to keep up with the latest development.
Machine learning is not only a skill, but also a bit of sport. We can compete in several machine learning competitions; sometimes for decent cash prizes, sometimes for joy, most of the time for playing to strengths. However, to win these competitions, we may need to utilize certain techniques, which are only useful in the context of competitions and not in the context of trying to solve a business problem. That's right, the "no free lunch" theorem applies here.

A machine learning system is fed with input data—this can be numerical, textual, visual, or audiovisual. The system usually has outputs—this can be a floating-point number, for instance, the acceleration of a self-driving car, can be an integer representing a category (also called a class), for example, a cat or tiger from image recognition.

The main task of machine learning is to explore and construct algorithms that can learn from historical data and make predictions on new input data. For a data-driven solution, we need to define (or have it defined for us by an algorithm) an evaluation function called loss or cost function, which measures how well the models are learning. In this setup, we create an optimization problem with the goal of learning in the most efficient and effective way.

Depending on the nature of the learning data, machine learning tasks can be broadly classified into three categories as follows:

  • Unsupervised learning: when learning data contains only indicative signals without any description attached, it is up to us to find structure of the data underneath, to discover hidden information, or to determine how to describe the data. This kind of learning data is called unlabeled data. Unsupervised learning can be used to detect anomalies, such as fraud or defective equipment, or to group customers with similar online behaviors for a marketing campaign.
  • Supervised learning: when learning data comes with description, targets or desired outputs besides indicative signals, the learning goal becomes to find a general rule that maps inputs to outputs. This kind of learning data is called labeled data. The learned rule is then used to label new data with unknown outputs. The labels are usually provided by event logging systems and human experts. Besides, if it is feasible, they may also be produced by members of the public through crowdsourcing for instance. Supervised learning is commonly used in daily applications, such as face and speech recognition, products or movie recommendations, and sales forecasting.
  • We can further subdivide supervised learning into regression and classification. Regression trains on and predicts a continuous-valued response, for example predicting house prices, while classification attempts to find the appropriate class label, such as analyzing positive/negative sentiment and prediction loan defaults.
  • If not all learning samples are labeled, but some are, we will have semi-supervised learning. It makes use of unlabeled data (typically a large amount) for training, besides a small amount of labeled. Semi-supervised learning is applied in cases where it is expensive to acquire a fully labeled dataset while more practical to label a small subset. For example, it often requires skilled experts to label hyperspectral remote sensing images, and lots of field experiments to locate oil at a particular location, while acquiring unlabeled data is relatively easy.
  • Reinforcement learning: learning data provides feedback so that the system adapts to dynamic conditions in order to achieve a certain goal. The system evaluates its performance based on the feedback responses and reacts accordingly. The best known instances include self-driving cars and chess master AlphaGo.

Feeling a little bit confused by the abstract concepts? Don't worry. We will encounter many concrete examples of these types of machine learning tasks later in the book. In Chapter 3, Spam Email Detection with Naive Bayes, to Chapter 6, Click-Through Prediction with Logistic Regression, we will see some supervised learning tasks and several classification algorithms; in Chapter 7, Stock Price Prediction with Regression Algorithms, we will continue with another supervised learning task, regression, and assorted regression algorithms; while in Chapter 2, Exploring the 20 Newsgroups Dataset with Text Analysis Algorithms, we will be given an unsupervised task and explore various unsupervised techniques and algorithms.