IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

Overview of this book

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art. Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace. Go beyond the basics and get the full power of your data mining workbench with this practical guide.

IBM SPSS Modeler Cookbook

Credits

Foreword

About the Authors

About the Reviewers

www.PacktPub.com

Preface

Free Chapter

Data Understanding

Introduction

Using an empty aggregate to evaluate sample size

Evaluating the need to sample from the initial data

Using CHAID stumps when interviewing an SME

Using a single cluster K-means as an alternative to anomaly detection

Using an @NULL multiple Derive to explore missing data

Creating an Outlier report to give to SMEs

Detecting potential model instability early using the Partition node and Feature Selection node

Data Preparation – Select

Introduction

Using the Feature Selection node creatively to remove or decapitate perfect predictors

Running a Statistics node on anti-join to evaluate the potential missing data

Evaluating the use of sampling for speed

Removing redundant variables using correlation matrices

Selecting variables using the CHAID Modeling node

Selecting variables using the Means node

Selecting variables using single-antecedent Association Rules

Data Preparation – Clean

Introduction

Binning scale variables to address missing data

Using a full data model/partial data model approach to address missing data

Imputing in-stream mean or median

Imputing missing values randomly from uniform or normal distributions

Using random imputation to match a variable's distribution

Searching for similar records using a Neural Network for inexact matching

Using neuro-fuzzy searching to find similar names

Producing longer Soundex codes

Data Preparation – Construct

Introduction

Building transformations with multiple Derive nodes

Calculating and comparing conversion rates

Grouping categorical values

Transforming high skew and kurtosis variables with a multiple Derive node

Creating flag variables for aggregation

Using Association Rules for interaction detection/feature creation

Creating time-aligned cohorts

Data Preparation – Integrate and Format

Introduction

Speeding up merge with caching and optimization settings

Merging a lookup table

Shuffle-down (nonstandard aggregation)

Cartesian product merge using key-less merge by key

Multiplying out using Cartesian product merge, user source, and derive dummy

Changing large numbers of variable names without scripting

Parsing nonstandard dates

Parsing and performing a conversion on a complex stream

Sequence processing

Selecting and Building a Model

Introduction

Evaluating balancing with Auto Classifier

Building models with and without outliers

Using Neural Network for Feature Selection

Creating a bootstrap sample

Creating bagged logistic regression models

Using KNN to match similar cases

Using Auto Classifier to tune models

Next-Best-Offer for large datasets

Modeling – Assessment, Evaluation, Deployment, and Monitoring

Introduction

How (and why) to validate as well as test

Using classification trees to explore the predictions of a Neural Network

Correcting a confusion matrix for an imbalanced target variable by incorporating priors

Using aggregate to write cluster centers to Excel for conditional formatting

Creating a classification tree financial summary using aggregate and an Excel Export node

Reformatting data for reporting with a Transpose node

Changing formatting of fields in a Table node

Combining generated filters

CLEM Scripting

Introduction

Building iterative Neural Network forecasts

Quantifying variable importance with Monte Carlo simulation

Implementing champion/challenger model management

Detecting outliers with the jackknife method

Optimizing K-means cluster solutions

Automating time series forecasts

Automating HTML reports and graphs

Rolling your own modeling algorithm – Weibull analysis

Business Understanding

Introduction

Define business objectives by Tom Khabaza

Assessing the situation by Meta Brown

Translating your business objective into a data mining objective by Dean Abbott

Produce a project plan – ensuring a realistic timeline by Keith McCormick

Index

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Using a single cluster K-means as an alternative to anomaly detection

Cleaning data includes detecting and eliminating outliers. When outliers are viewed as a property of individual variables, it is easy to examine a data set, one variable at a time, and identify which records fall outside the usual range for a given variable. However, from a multivariate point of view, the concept of an outlier is less obvious; individual values may fall within accepted bounds but a combination of values may still be unusual.

The concept of multivariate outliers is used a great deal in anomaly detection, and this can be used both for data cleaning and more directly for applications such as fraud detection. Clustering techniques are often used for this purpose; in effect a clustering model defines different kinds of normal (the different clusters) and items falling outside these definitions may be considered anomalous. Techniques of anomaly detection using clustering vary from sophisticated, perhaps using multiple clustering models and comparing the results, through single-model examples such as the use of TwoStep in Modeler's Anomaly algorithm, to the very simple.

The simplest kind of anomaly detection with clustering is to create a cluster model with only one cluster. The distance of a record from the cluster center can then be treated as a measure of anomaly, unusualness or outlierhood. This recipe shows how to use a single-cluster K-means model in this way, and how to analyze the reasons why certain records are outliers.

Getting ready

This recipe uses the following files:

Data file: cup98LRN.txt
Stream file: Single_Cluster_Kmeans.str
Clementine output file: Histogram.cou

How to do it...

To use a single cluster K-means as an alternative to anomaly detection:

Open the stream Single_Cluster_Kmeans.str by clicking on File | Open Stream.
Edit the Type node near the top-left of the stream; note that the customer ID and zip code have been excluded from the model, and the other 5 fields have been included as inputs.
Run the Histogram node $KMD-K-Means to show the distribution of distances from the cluster center. Note that a few records are grouped towards the upper end of the range.
Open the output file Histogram.cou by selecting the Outputs tab at the top-right of the user interface, right-click in this pane to see the pop-up menu, select Open Output from this menu, then browse and select the file Histogram.cou. You will see the graph in the following figure, including a boundary (the red line) that was placed manually to identify the area of the graph that, visually, appears to contain outliers. The band to the right of this line was used to generate the Select node and Derive node included in the stream, both labeled band2.
Run the Table node outliers; this displays the 8 records we have identified as outliers from the histogram, including their distance from the cluster center, as shown in the following screenshot. Note that they are all from the same cluster because there is only one cluster.

So far we have used the single-cluster K-means model to identify outliers, but why are they outliers? We can create a profile of these outliers to explain why they are outliers, by creating a rule-set model using the C5.0 algorithm to distinguish items that are in band2 from those that are not. This is a common technique used in Modeler to find explanations for the behavior of clustering models that are difficult to interrogate directly. The following steps show how:

Edit the Type node near the lower-right of the stream, as shown in the following screenshot. This is used to create the C5.0 rule-set model; note that the inputs are the same as for the initial cluster model, both outputs of the cluster model have been excluded, and the target is the derived field band2, a Boolean that identifies the outliers.
Browse the C5.0 model, band2 and then use the Model pane to see all the rules and their statistics, as shown in the following screenshot. All the rules are highly accurate; even though they are not perfect, this is a successful profiling model in that it can distinguish reliably between outliers and others. This model shows how the cluster model has defined outliers: those records that have the rare values U and J for the GENDER field. The even more rare value C has not been identified, because its single occurrence was insufficient to have an impact on the model.

How it works...

Imagine a five-dimensional scatter-plot showing the 5 variables used for the cluster model and normalized. The records from the data set appear as a clump, and somewhere within that clump is its center of gravity. Some items fall at the edges of this clump; some may be visually outside it. The clump is the cluster discovered by K-means, and the items falling visually outside the clump are outliers.

Assuming the clump to be roughly spherical, the items outside the clump will be those at the greatest distance from its center, and have a gap between them and the edges of the clump. This corresponds to the gap in the histogram where we create a band of outliers from the histogram, which we have used manually to identify the band of outliers. The C5.0 rule-set is a convenient way to see a description of these outliers, more specifically how they differ from items inside the clump.

There's more...

The final step mentions that the unique value C in the GENDER field has not been discovered in this instance because it is too rare to have an impact on the model. In fact, it is only too rare to have an impact on the relatively simplistic single-cluster model. It is possible for a K-means model to discover this outlier, and it will do so if used with its default setting of 5 clusters. This illustrates that the technique of using the distance from the cluster center to find outliers is more general than the single-cluster technique and can be used with any K-means model, or any clustering model that can output this distance.

IBM SPSS Modeler Cookbook

IBM SPSS Modeler Cookbook

Overview of this book

Related Content you might be interested in

Current Title:

IBM SPSS Modeler Cookbook

Using a single cluster K-means as an alternative to anomaly detection

Getting ready

How to do it...

How it works...

There's more...