Book Image

IBM SPSS Modeler Cookbook

Book Image

IBM SPSS Modeler Cookbook

Overview of this book

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art. Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace. Go beyond the basics and get the full power of your data mining workbench with this practical guide.
Table of Contents (17 chapters)
IBM SPSS Modeler Cookbook
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Translating your business objective into a data mining objective by Dean Abbott


The business objectives use business language to describe the purpose of the data mining project. However, business objectives are not sufficiently specific to build predictive models; business objectives must be translated into data mining goals. These data mining objectives should be expressed in the language of data mining or data mining software so that the objectives are clear and reproducible.

For example, let's assume the federal government is trying to crack down on government-contracting invoice fraud. A broad business objective may be to identify fraudulent invoices more effectively from the millions of invoices submitted annually. A more specific business objective may be to develop predictive models to identify 100 invoices per month for investigators to examine that are highly likely to be fraudulent.

For the former, the business objective can be to create a data mining objective such as building classification models to predict the likelihood of an invoice being fraudulent. Note that this definition not only defines what type of model will be built (classification) but also the level of analysis to be used (each record is an invoice, rather than an invoice payee).

One could be more specific in the data mining definition to clarify the nature of the prediction. Rather than a single binary outcome of fraudulent versus nonfraudulent as the label for each invoice, if we hypothesize that models can be built more accurately and that the specific type of fraud should be predicted, our data mining objective can be rephrased as building classification models to predict the likelihood of an invoice belonging to each of the four types of invoice fraud.

One other aspect to consider in creating the data mining objective is that data mining projects require data miners with particular skill sets. The tuning of the business objective should keep in mind who will be performing the analysis so that there is sufficient matching of the skill set with the analyst.

The key to the translation – specifying target variables

As we have seen, the data mining objective(s) is a translation of the business objective, worded in the language of data mining. Latent in the data mining objective is perhaps the most critical part of the data mining objective definition: specification of one or more target variables. The very process of creating one or more columns in the data that are the target variables requires clarity and specificity in ways that can be finessed when only words are used in describing the data mining objective.

In the invoice fraud example, the very definition of fraud is key to the data mining modeling process. Two definitions are often considered in fraud detection. The first definition is the strict one, labeling an invoice as fraudulent if and only if the case has been prosecuted and the payee of the invoice has been convicted of fraud. The second definition is that of a looser, labeling an invoice as fraudulent if the invoice has been identified as being worthy of investigating further by one or more managers or agents. In the second definition, the invoice has failed the "smell test" but there is no proof yet that the invoice is fraudulent. Many more potential target variable definitions exist, but these two are each reasonable definitions for our consideration here.

Note that there are advantages and disadvantages of each option. The primary advantage of the first definition is clarity; all of those labeled 1 are clearly fraudulent. However, there are also several disadvantages. First, some invoices may have been fraudulent, but they did not meet the standard for a successful prosecution. Some may have been dismissed based on technicalities. Others may have had potential but were too complex to prosecute efficiently. Still others may have shown potential, but the agency did not have sufficient resources to pursue the case. The result is that many 0 values really have potential but are labeled the same as those cases that have no potential at all. In fact, there may be more of these ambiguous nonfraudulent invoices than there are fraudulent invoices in the data.

On the other hand, if we use the second definition, many cases labeled 1 may not be fraudulent after all, even if they appear suspicious upon first glance. In other words, some "fraudulent" labels are done prematurely; If we had waited long enough for the case to proceed, it would have been clear that the invoice was not fraudulent after all. Relaxed definitions of fraud can increase the number of invoices labeled as fraudulent in the data by a factor of 10 or more.

There is no perfect definition of a target variable. The definition should be formulated to match the business objective as completely as possible and to meet the core business objectives of the organization. It may be the case that the best match of a target variable definition with the business doesn't exist, and a compromised target variable must instead be selected. It is often the case that these compromises, while not desirable, enable the organization to build models that improve upon already established practices, and thus are still valuable.

Data mining success criteria – measuring how good the models actually are

The determination of what is considered as a good model is project-dependent and depends on the business success criterion or criteria. If the purpose of the model is to provide highly accurate predictions or decisions to be used by the business, measures of accuracy will be used. If interpretation of the business is what is of most interest, accuracy measures will not be used; instead, subjective measures of what provides maximum insight may be most desirable. Some projects may use a combination of both so that the most accurate model is not selected if a less accurate but more transparent model with nearly the same accuracy is available.

Success criteria for classification

For classification problems, the most frequent metrics for model selection in data mining include Percent Correct Classification (PCC); confusion matrix metrics such as precision and recall, sensitivity and specificity, Type I and Type II errors, and false alarms and false dismissals; and rank-ordered metrics such as Lift, Gain, ROC, and Area Under the Curve (AUC). AUC can be computed from any of the rank-ordered metrics.

PCC and the confusion matrix metrics are good when an entire population must be scored and acted on. Medical diagnoses are an example of this. If one will treat only a subset of the population, rank-ordering the population and acting on only a portion of those in that "select" group can be accomplished through metrics such as Lift, Gain, ROC, and AUC.

Also, any number of customized cost functions can be created from the quadrants of a confusion matrix. Most commonly, practitioners will weigh the quadrants to emphasize some errors over others as being particularly unwelcome. If one would like to reduce false alarms, for example, one could weigh these twice as much as false dismissals and create a single score base on the custom formula.

Success criteria for estimation

For continuous-valued estimation problems, metrics often used for assessing models are R^2, average error, Mean Squared Error (MSE), median error, average absolute error, and median absolute error. In each of these metrics, one first computes the error of an estimate, which is the actual value minus the predicted estimate. The metrics then sum errors over all the records in the data.

Average errors can be useful in determining whether the models are biased toward positive or negative errors. Average absolute errors are useful in estimating the magnitude of the errors (whether positive or negative). Analysts most often examine not only the overall value of the success criterion, but also examine the entire range of predicted values by considering scatter plots of actual versus predicted values or actual versus residuals (errors).

In principal, one can also include rank-ordered metrics such as AUC and Gain as candidates to estimate the success criteria, though they often are not included in data mining software for estimation problems. In these instances, one needs to create a customized success criterion.

Other customized success criteria

Sometimes none of the typical success criteria are sufficient to evaluate predictive models because they do not match the business objective. Consider the invoice fraud example described earlier. Let's assume that the purpose of the model is to identify 100 invoices per month to investigate from the hundreds of thousands of invoices submitted. If one builds a classification model and selects a model that maximizes PCC, we can be fooled into thinking that the best model as assessed by PCC is good, even though none of the top 100 invoices are good candidates for investigation. How is this possible? If there are 100,000 invoices submitted in a month, we are selecting only 0.1 percent of them for investigation. The model could be perfect for 99.9 percent of the population and miss what we care about the most, the top 100.

In situations such as this one, when there are very specific needs for the organization, it is best to consider customized cost functions. In this instance, we want to identify a population of 100 invoices such that it maximizes the chances of these 100 invoices being true alerts (not false alarms). What metric does this? No metric addresses this directly, though ROC curves are close to the idea. Instead, the best way to rank the models is the direct method, that is, pick the model that maximizes the true fraud alert rate in the top 100 invoices of the scored population, ignoring the rest of the population. Data miners should adjust their algorithm settings appropriately to focus the attention of the classifiers on accuracy at the top of the predicted probabilities, such as weighting the cost of false alarms higher than the errors estimating true alerts.

Another candidate for customized scoring functions include Return On Investment (ROI) or profit, where there is a fixed or variable cost associated with the treatment of a customer or transaction (a record in the data), and a fixed or variable return or benefit if the customer responds favorably. For example, if one is building a customer acquisition model, the cost is typically a fixed cost associated with mailing or calling the individual; the return is the estimated value of acquiring a new customer. For fraud detection, there is a cost associated with investigating the invoice or claim, and a gain associated with the successful recovery of the fraudulent dollar amount.

Note that for many customized success criteria, the actual predicted values are not nearly as important as the rank order of the predicted values. If one computes the cumulative net revenue as a customized cost function associated with a model, the predicted probability may never enter into the final report, except as a means to threshold the population into the "select" group (that is to be treated) and the "nonselect" group.