Learning Data Mining with Python

Learning Data Mining with Python - Second Edition

By : Robert Layton

Buy this Book

Learning Data Mining with Python - Second Edition

By: Robert Layton

Buy this Book

Overview of this book

This book teaches you to design and develop data mining applications using a variety of datasets, starting with basic classification and affinity analysis. This book covers a large number of libraries available in Python, including the Jupyter Notebook, pandas, scikit-learn, and NLTK. You will gain hands on experience with complex data types including text, images, and graphs. You will also discover object detection using Deep Neural Networks, which is one of the big, difficult areas of machine learning right now. With restructured examples and code samples updated for the latest edition of Python, each chapter of this book introduces you to new algorithms and techniques. By the end of the book, you will have great insights into using Python for data mining and understanding of the algorithms as well as implementations.

Title Page

Credits

About the Author

About the Reviewer

www.PacktPub.com

Customer Feedback

Preface

Free Chapter

Getting Started with Data Mining

Introducing data mining

Using Python and the Jupyter Notebook

A simple affinity analysis example

Product recommendations

A simple classification example

What is classification?

Summary

Classifying with scikit-learn Estimators

scikit-learn estimators

Preprocessing

Pipelines

Summary

Predicting Sports Winners with Decision Trees

Loading the dataset

Decision trees

Sports outcome prediction

Random forests

Summary

Recommending Movies Using Affinity Analysis

Affinity analysis

Dealing with the movie recommendation problem

Understanding the Apriori algorithm and its implementation

Summary

Features and scikit-learn Transformers

Feature extraction

Feature selection

Feature creation

Principal Component Analysis

Creating your own transformer

Unit testing

Putting it all together

Summary

Social Media Insight using Naive Bayes

Disambiguation

Downloading data from a social network

Text transformers

Naive Bayes

Applying of Naive Bayes

Getting useful features from models

Summary

Follow Recommendations Using Graph Mining

Loading the dataset

Getting follower information from Twitter

Creating a graph

Finding subgraphs

Summary

Beating CAPTCHAs with Neural Networks

Artificial neural networks

Creating the dataset

Training and classifying

Predicting words

Summary

Authorship Attribution

Attributing documents to authors

Getting the data

Using function words

Support Vector Machines

Character n-grams

The Enron dataset

Putting it all together

Evaluation

Summary

Clustering News Articles

Preface

The second revision of Learning Data Mining with Python was written with the programmer in mind. It aims to introduce data mining to a wide range of programmers, as I feel that this is critically important to all those in the computer science field. Data mining is quickly becoming the building block of the next generation of Artificial Intelligence systems. Even if you don't find yourself building these systems, you will be using them, interfacing with them, and being guided by them. Understand the process behind them is important and helps you get the best out of them. The second revision builds upon the first. Many of chapters and exercises are similar, although new concepts are introduced and exercises are expanded in scope. Those that had read the first revision should be able to move quickly through the book and pick up new knowledge along the way and engage with the extra activities proposed. Those new to the book are encouraged to take their time, do the exercises and experiment. Feel free to break the code to understand it, and reach out if you have any questions. As this is a book aimed at programmers, we assume that you have some knowledge of programming and of Python itself. For this reason, there is little explanation of what the Python code itself is doing, except in cases where it is ambiguous.

What this book covers

Chapter 1, Getting started with data mining, introduces the technologies we will be using, along with implementing two basic algorithms to get started.

Chapter 2, Classifying with scikit-learn, covers classification, a key form of data mining. You’ll also learn about some structures for making your data mining experimentation easier to perform..

Chapter 3, Predicting Sports Winners with Decisions Trees, introduces two new algorithms, Decision Trees and Random Forests, and uses it to predict sports winners by creating useful features..

Chapter 4, Recommending Movies using Affinity Analysis, looks at the problem of recommending products based on past experience, and introduces the Apriori algorithm.

Chapter 5, Features and scikit-learn Transformers, introduces more types of features you can create, and how to work with different datasets.

Chapter 6, Social Media Insight using Naive Bayes, uses the Naïve Bayes algorithm to automatically parse text-based information from the social media website Twitter.

Chapter 7, Follow Recommendations Using Graph Mining, applies cluster analysis and network analysis to find good people to follow on social media.

Chapter 8, Beating CAPTCHAs with Neural Networks, looks at extracting information from images, and then training neural networks to find words and letters in those images.

Chapter 9, Authorship attribution, looks at determining who wrote a given documents, by extracting text-based features and using Support Vector Machines.

Chapter 10, Clustering news articles, uses the k-means clustering algorithm to group together news articles based on their content.

Chapter 11,Object Detection in Images using Deep Neural Networks, determines what type of object is being shown in an image, by applying deep neural networks.

Chapter 12, Working with Big Data, looks at workflows for applying algorithms to big data and how to get insight from it.

Appendix, Next step, goes through each chapter, giving hints on where to go next for a deeper understanding of the concepts introduced.

What you need for this book

It should come as no surprise that you’ll need a computer, or access to one, to complete the book. The computer should be reasonably modern, but it doesn’t need to be overpowered. Any modern processor (from about 2010 onwards) and 4 gigabytes of RAM will suffice, and you can probably run almost all of the code on a slower system too.

The exception here is with the final two chapters. In these chapters, I step through using Amazon’s web services (AWS) for running the code. This will probably cost you some money, but the advantage is less system setup than running the code locally. If you don’t want to pay for those services, the tools used can all be set-up on a local computer, but you will definitely need a modern system to run it. A processor built in at least 2012, and more than 4 GB of RAM are necessary.

I recommend the Ubuntu operating system, but the code should work well on Windows, Macs, or any other Linux variant. You may need to consult the documentation for your system to get some things installed though.

In this book, I use pip for installing code, which is a command line tool for installing Python libraries. Another option is to use Anaconda, which can be found online here: http://continuum.io/downloads

I also have tested all code using Python 3. Most of the code examples work on Python 2 with no changes. If you run into any problems, and can’t get around it, send an email and we can offer a solution.

Who this book is for

This book is for programmers that want to get started in data mining in an application-focused manner.

If you haven’t programmed before, I strongly recommend that you learn at least the basics before you get started. This book doesn’t introduce programming, nor does it give too much time to explaining the actual implementation (in-code) of how to type out the instructions. That said, once you go through the basics, you should be able to come back to this book fairly quickly – there is no need to be an expert programmer first!

I highly recommend that you have some Python programming experience. If you don’t, feel free to jump in, but you might want to take a look at some Python code first, possibly focused on tutorials using the IPython notebook. Writing programs in the IPython notebook works a little differently than other methods, such as writing a Java program in a fully-fledged IDE.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The next lines of code read the link and assign it to the to the dataset_filename function."

A block of code is set as follows:

import numpy as np 
dataset_filename = "affinity_dataset.txt" 
X = np.loadtxt(dataset_filename)

Any command-line input or output is written as follows:

$ conda install scikit-learn

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter."

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you're looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Learning-Data-Mining-with-Python-Second-Edition. The benefit of the github repository is that any issues with the code, including problems relating to software version changes, will be kept track of and the code there will include changes from readers around the world. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out! To avoid indention issues please use the code bundle to run the codes in the IDE instead of copying directly from the PDF.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at [email protected] with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.

Learning Data Mining with Python - Second Edition

By : Robert Layton

Learning Data Mining with Python - Second Edition

By: Robert Layton

Overview of this book

Related Content you might be interested in

Current Title:

Learning Data Mining with Python - Second Edition

Hands-On Recommendation Systems with Python

Building Machine Learning Systems with Python

Hands-On Automated Machine Learning

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Note

Note

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions