IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

Overview of this book

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art. Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace. Go beyond the basics and get the full power of your data mining workbench with this practical guide.

IBM SPSS Modeler Cookbook

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Data Understanding

Introduction

Using an empty aggregate to evaluate sample size

Evaluating the need to sample from the initial data

Using CHAID stumps when interviewing an SME

Using a single cluster K-means as an alternative to anomaly detection

Using an @NULL multiple Derive to explore missing data

Creating an Outlier report to give to SMEs

Detecting potential model instability early using the Partition node and Feature Selection node

Data Preparation – Select

Introduction

Using the Feature Selection node creatively to remove or decapitate perfect predictors

Running a Statistics node on anti-join to evaluate the potential missing data

Evaluating the use of sampling for speed

Removing redundant variables using correlation matrices

Selecting variables using the CHAID Modeling node

Selecting variables using the Means node

Selecting variables using single-antecedent Association Rules

Data Preparation – Clean

Introduction

Binning scale variables to address missing data

Using a full data model/partial data model approach to address missing data

Imputing in-stream mean or median

Imputing missing values randomly from uniform or normal distributions

Using random imputation to match a variable's distribution

Searching for similar records using a Neural Network for inexact matching

Using neuro-fuzzy searching to find similar names

Producing longer Soundex codes

Data Preparation – Construct

Introduction

Building transformations with multiple Derive nodes

Calculating and comparing conversion rates

Grouping categorical values

Transforming high skew and kurtosis variables with a multiple Derive node

Creating flag variables for aggregation

Using Association Rules for interaction detection/feature creation

Creating time-aligned cohorts

Data Preparation – Integrate and Format

Introduction

Speeding up merge with caching and optimization settings

Merging a lookup table

Shuffle-down (nonstandard aggregation)

Cartesian product merge using key-less merge by key

Multiplying out using Cartesian product merge, user source, and derive dummy

Changing large numbers of variable names without scripting

Parsing nonstandard dates

Parsing and performing a conversion on a complex stream

Sequence processing

Selecting and Building a Model

Introduction

Evaluating balancing with Auto Classifier

Building models with and without outliers

Using Neural Network for Feature Selection

Creating a bootstrap sample

Creating bagged logistic regression models

Using KNN to match similar cases

Using Auto Classifier to tune models

Next-Best-Offer for large datasets

Modeling – Assessment, Evaluation, Deployment, and Monitoring

Introduction

How (and why) to validate as well as test

Using classification trees to explore the predictions of a Neural Network

Correcting a confusion matrix for an imbalanced target variable by incorporating priors

Using aggregate to write cluster centers to Excel for conditional formatting

Creating a classification tree financial summary using aggregate and an Excel Export node

Reformatting data for reporting with a Transpose node

Changing formatting of fields in a Table node

Combining generated filters

CLEM Scripting

Introduction

Building iterative Neural Network forecasts

Quantifying variable importance with Monte Carlo simulation

Implementing champion/challenger model management

Detecting outliers with the jackknife method

Optimizing K-means cluster solutions

Automating time series forecasts

Automating HTML reports and graphs

Rolling your own modeling algorithm – Weibull analysis

Business Understanding

Introduction

Define business objectives by Tom Khabaza

Assessing the situation by Meta Brown

Translating your business objective into a data mining objective by Dean Abbott

Produce a project plan – ensuring a realistic timeline by Keith McCormick

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Introduction

This opening chapter is regarding data understanding, but this phase is not the first phase of CRISP-DM. Business understanding is a critical phase. Some would argue, including the authors of this book, that business understanding is the phase in most need of more attention by new data miners. It is certainly a candidate for the phase that is most rushed, albeit rushed at the peril of the data mining project. However, since this book is focused on specific software tasks and recipes, and since business understanding is conducted in the meeting room, not alone at one's laptop, our discussion of this phase is placed in a special section of the book. If you are new to data mining please do read the business understanding section first (refer Appendix, Business Understanding), and consider reading the CRISP-DM document in its entirety as it will place our recipes in a broader context.

The CRISP-DM document covers the initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.

CRISP-DM lists the following tasks as a part of the data understanding phase:

Collect the data
Describe the data
Explore the data
Data quality

In this chapter we will introduce some of the IBM SPSS Modeler nodes associated with these tasks as well as nodes that one might associate with other phases, but that can prove useful during data understanding. Since the recipes are orientated around software tasks, there is a particular focus on exploring and data quality. Many of these recipes could be done immediately after accessing your data for the first time. Some of the hard work that follows will be inspired by what you uncover using these recipes.

The very first task you will need to do when data mining is to determine the size and nature of the data subset that you will be working with. This might involve sampling or balancing (a special kind of sampling) or both, but should always be thoughtful. Why sample? When you have plentiful data, a powerful computer and equally powerful software, why not use every bit of that?

There was a time when one of the most popular concepts in data mining was to put an end to sampling. And this was not without reason. If the objective of data mining was to give business people the power to make discoveries from data independently, then it made sense to reduce the number of steps in any way possible. As computers and computer memory became less expensive, it seemed that sampling was a waste of time. And then, there was the idea of finding a valuable and elusive bit of information in a mass of data. This image was so powerful that it inspired the name for a whole field of study—data mining. To eliminate any data from the working dataset was to risk losing treasured insights.

Times change, and so have the attitudes of the data mining community. For one thing, many of today's data miners began in more traditional data analyst roles, and were familiar with classical statistics before they entered data mining. These data miners don't want to be without the full set of methods that they have used earlier in their careers. They expect their data mining tools to include statistical analysis capability, and sampling is central to classical statistical analysis. Business users may not have driven the shift toward sampling in data mining, but they have not stood in the way. Perhaps this is because many business people had some exposure to statistical analysis in school, or because the idea of sampling simply appeals to their common sense. Today, in stark contrast to some discussions of Big Data, sampling is a routine part of data mining. We will address related issues in our first two recipes.

Data understanding often involves close collaboration with others. This point might be forgotten in skimming this list of recipes since most of them could be done by a solitary analyst. The Using CHAID stumps when interviewing an SME recipe, underscores the importance of collaboration. Note that CHAID is used here to serve data exploration, not modeling. A primary goal of this phase is to uncover facts that need to be discussed with others, whether they be analyst colleagues, Subject Matter Experts (SMEs), IT support, or management.

There is always the possibility (some veterans might suggest that it is a near certainty) that you will have to circle back to business understanding to address new discoveries that you make when you actively start looking at data. Many of the other recipes in this chapter might also yield discoveries of this kind. Some time ago, Dean Abbott wrote a blog post on this subject entitled Doing Data Mining Out of Order:

Data mining often requires more creativity and "art" to re-work the data than we would like, ... but unfortunately data doesn't always cooperate in this way, and we therefore need to adapt to the specific data problems so that the data is better prepared.
In this project, we jumped from Business Understanding and the beginnings of Data Understanding straight to Modeling. I think in this case, I would call it "modeling" (small 'm') because we weren't building models to predict risk, but rather to understand the target variable better. We were not sure exactly how clean the data was to begin with, especially the definition of the target variable, because no one had ever looked at the data in aggregate before, only on a single customer -by-customer basis. By building models, and seeing some fields that predict the target variable 'too well', we have been able to identify historic data inconsistencies and miscoding.

One could argue this modeling with a small "m" should always be part of data understanding. The Using CHAID stumps when interviewing an SME recipe, explores how to model efficiently. CHAID is a good method to explore data. It builds wide trees that are easy for most to read, and they treat missing data as a separate category that invites a lot of discussion about the missing values. The idea of a stump is simply a tree that has been grown only to the first branch. As we shall see, it is a good idea to grow a decision stump for the top 10 inputs as well as any SME variables of interest. It is a structured, powerful, and even enjoyable way to work through data understanding.

Dean also wrote:

Now that we have the target variable better defined, I'm going back to the data understanding and data prep stages to complete those stages properly, and this is changing how the data will be prepped in addition to modifying the definition of the target variable. It's also much more enjoyable to build models than do data prep.

It is always wise to consider writing an interim report when you near completion of a phase. A data understanding report can be a great way to protect yourself against accusations that you failed to include variables of interest in a Model. It is in this phase that you will start to determine what we actually have at your disposal, and what information you might not be able to get. The Outliers (quirk) report, and the exact logic you used to choose your subset, are precisely the kind of information that you would want to include in such a report.

IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

IBM SPSS Modeler Cookbook

Introduction