IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

Overview of this book

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art. Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace. Go beyond the basics and get the full power of your data mining workbench with this practical guide.

IBM SPSS Modeler Cookbook

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Data Understanding

Introduction

Using an empty aggregate to evaluate sample size

Evaluating the need to sample from the initial data

Using CHAID stumps when interviewing an SME

Using a single cluster K-means as an alternative to anomaly detection

Using an @NULL multiple Derive to explore missing data

Creating an Outlier report to give to SMEs

Detecting potential model instability early using the Partition node and Feature Selection node

Data Preparation – Select

Introduction

Using the Feature Selection node creatively to remove or decapitate perfect predictors

Running a Statistics node on anti-join to evaluate the potential missing data

Evaluating the use of sampling for speed

Removing redundant variables using correlation matrices

Selecting variables using the CHAID Modeling node

Selecting variables using the Means node

Selecting variables using single-antecedent Association Rules

Data Preparation – Clean

Introduction

Binning scale variables to address missing data

Using a full data model/partial data model approach to address missing data

Imputing in-stream mean or median

Imputing missing values randomly from uniform or normal distributions

Using random imputation to match a variable's distribution

Searching for similar records using a Neural Network for inexact matching

Using neuro-fuzzy searching to find similar names

Producing longer Soundex codes

Data Preparation – Construct

Introduction

Building transformations with multiple Derive nodes

Calculating and comparing conversion rates

Grouping categorical values

Transforming high skew and kurtosis variables with a multiple Derive node

Creating flag variables for aggregation

Using Association Rules for interaction detection/feature creation

Creating time-aligned cohorts

Data Preparation – Integrate and Format

Introduction

Speeding up merge with caching and optimization settings

Merging a lookup table

Shuffle-down (nonstandard aggregation)

Cartesian product merge using key-less merge by key

Multiplying out using Cartesian product merge, user source, and derive dummy

Changing large numbers of variable names without scripting

Parsing nonstandard dates

Parsing and performing a conversion on a complex stream

Sequence processing

Selecting and Building a Model

Introduction

Evaluating balancing with Auto Classifier

Building models with and without outliers

Using Neural Network for Feature Selection

Creating a bootstrap sample

Creating bagged logistic regression models

Using KNN to match similar cases

Using Auto Classifier to tune models

Next-Best-Offer for large datasets

Modeling – Assessment, Evaluation, Deployment, and Monitoring

Introduction

How (and why) to validate as well as test

Using classification trees to explore the predictions of a Neural Network

Correcting a confusion matrix for an imbalanced target variable by incorporating priors

Using aggregate to write cluster centers to Excel for conditional formatting

Creating a classification tree financial summary using aggregate and an Excel Export node

Reformatting data for reporting with a Transpose node

Changing formatting of fields in a Table node

Combining generated filters

CLEM Scripting

Introduction

Building iterative Neural Network forecasts

Quantifying variable importance with Monte Carlo simulation

Implementing champion/challenger model management

Detecting outliers with the jackknife method

Optimizing K-means cluster solutions

Automating time series forecasts

Automating HTML reports and graphs

Rolling your own modeling algorithm – Weibull analysis

Business Understanding

Introduction

Define business objectives by Tom Khabaza

Assessing the situation by Meta Brown

Translating your business objective into a data mining objective by Dean Abbott

Produce a project plan – ensuring a realistic timeline by Keith McCormick

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Creating an Outlier report to give to SMEs

It is quite common that the data miner has to rely on others to either provide data or interpret data, or both. Even when the data miner is working with data from their own organization there will be input variables that they don't have direct access to, or that are outside their day-to-day experience.

Are zero values normal? What about negative values? Null values? Are 1500 balance inquiries in a month even possible? How could a wallet cost $19,500? The concept of outliers is something that all analysts are familiar with. Even novice users of Modeler could easily find a dozen ways of identifying some. This recipe is about identifying outliers systematically and quickly so that you can produce a report designed to inspire curiosity.

There is no presumption that the data is in error, or that they should be removed. It is simply an attempt to put the information in the hands of Subject Matter Experts, so quirky values can be discussed in the earliest phases of the projects. It is important to provide whichever primary keys are necessary for the SMEs to look up the records. On one of the author's recent projects, the team started calling these reports quirk reports.

Getting ready

We will start with the Outlier Report.str stream that uses the TELE_CHURN_preprep data set.

How to do it...

To create an Outlier report:

Open the stream Outlier Report.str.
Add a Data Audit node and examine the results.
Adjust the stream options to allow for 25 rows to be shown in a data preview. We will be using the preview feature later in the recipe.
Add a Statistics node. Choose Mean, Min, Max, and Median for the variables DATA_gb, PEAK_mins, and TEXT_count. These three have either unusually high maximums or surprising negative values as shown in the Data Audit node.
Consider taking a screenshot of the Statistics node for later use.
Add a Sort node. Starting with the first variable, DATA_gb, sort in ascending order.
Add a Filter node downstream of the Sort node dropping CHURN, DROPPED_CALLS, and LATE_PAYMENTS. It is important to work with your SME to know which variables put quirky values into context.
Preview the Filter node. Consider the following screenshot:
Reverse the sort, now choosing descending order, and preview the Filter node. Consider the following screenshot for later use:
Sort in descending order on the next variable, PEAK_mins. Preview the Filter node.
Finally sort the variable, TEXT_count, in descending order and preview the Filter node.
Examine Outliers.docx to see an example of what this might look like in Word.

How it works...

There is no deep theoretical foundation to this recipe; it is as straightforward as it seems. It is simply a way of quickly getting information to an SME. They will not be frequent Modeler users. Also summary statistics only give them a part of the story. Providing the min, max, mean and median alone will not allow an SME to give you the information that you need. If there is a usual min such as a negative value, you need to know how many negatives there are, and need at least a handful of actual examples with IDs. An SME might look up to values in their own resources and the net result could be the addition of more variables to the analysis. Alternatively, negative values might be turned into nulls or zeros. Negative values might be deemed out of scope and removed from the analysis. There is no way to know until you assess why they are negative. Sometimes values that are exactly zero are of interest. High values, NULL values, and rare categories are all of potential interest. The most important thing is to be curious (and pleasantly persistent) and to inspire collaborators to be curious as well.

IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

IBM SPSS Modeler Cookbook

Creating an Outlier report to give to SMEs

Getting ready

How to do it...

How it works...

See also