The business objectives use business language to describe the purpose of the data mining project. However, business objectives are not sufficiently specific to build predictive models; business objectives must be translated into data mining goals. These data mining objectives should be expressed in the language of data mining or data mining software so that the objectives are clear and reproducible.
For example, let's assume the federal government is trying to crack down on government-contracting invoice fraud. A broad business objective may be to identify fraudulent invoices more effectively from the millions of invoices submitted annually. A more specific business objective may be to develop predictive models to identify 100 invoices per month for investigators to examine that are highly likely to be fraudulent.
For the former, the business objective can be to create a data mining objective such as building classification models to predict the likelihood of an invoice being fraudulent. Note that this definition not only defines what type of model will be built (classification) but also the level of analysis to be used (each record is an invoice, rather than an invoice payee).
One could be more specific in the data mining definition to clarify the nature of the prediction. Rather than a single binary outcome of fraudulent versus nonfraudulent as the label for each invoice, if we hypothesize that models can be built more accurately and that the specific type of fraud should be predicted, our data mining objective can be rephrased as building classification models to predict the likelihood of an invoice belonging to each of the four types of invoice fraud.
One other aspect to consider in creating the data mining objective is that data mining projects require data miners with particular skill sets. The tuning of the business objective should keep in mind who will be performing the analysis so that there is sufficient matching of the skill set with the analyst.
As we have seen, the data mining objective(s) is a translation of the business objective, worded in the language of data mining. Latent in the data mining objective is perhaps the most critical part of the data mining objective definition: specification of one or more target variables. The very process of creating one or more columns in the data that are the target variables requires clarity and specificity in ways that can be finessed when only words are used in describing the data mining objective.
In the invoice fraud example, the very definition of fraud is key to the data mining modeling process. Two definitions are often considered in fraud detection. The first definition is the strict one, labeling an invoice as fraudulent if and only if the case has been prosecuted and the payee of the invoice has been convicted of fraud. The second definition is that of a looser, labeling an invoice as fraudulent if the invoice has been identified as being worthy of investigating further by one or more managers or agents. In the second definition, the invoice has failed the "smell test" but there is no proof yet that the invoice is fraudulent. Many more potential target variable definitions exist, but these two are each reasonable definitions for our consideration here.
Note that there are advantages and disadvantages of each option. The primary advantage of the first definition is clarity; all of those labeled 1 are clearly fraudulent. However, there are also several disadvantages. First, some invoices may have been fraudulent, but they did not meet the standard for a successful prosecution. Some may have been dismissed based on technicalities. Others may have had potential but were too complex to prosecute efficiently. Still others may have shown potential, but the agency did not have sufficient resources to pursue the case. The result is that many 0 values really have potential but are labeled the same as those cases that have no potential at all. In fact, there may be more of these ambiguous nonfraudulent invoices than there are fraudulent invoices in the data.
On the other hand, if we use the second definition, many cases labeled 1 may not be fraudulent after all, even if they appear suspicious upon first glance. In other words, some "fraudulent" labels are done prematurely; If we had waited long enough for the case to proceed, it would have been clear that the invoice was not fraudulent after all. Relaxed definitions of fraud can increase the number of invoices labeled as fraudulent in the data by a factor of 10 or more.
There is no perfect definition of a target variable. The definition should be formulated to match the business objective as completely as possible and to meet the core business objectives of the organization. It may be the case that the best match of a target variable definition with the business doesn't exist, and a compromised target variable must instead be selected. It is often the case that these compromises, while not desirable, enable the organization to build models that improve upon already established practices, and thus are still valuable.
The determination of what is considered as a good model is project-dependent and depends on the business success criterion or criteria. If the purpose of the model is to provide highly accurate predictions or decisions to be used by the business, measures of accuracy will be used. If interpretation of the business is what is of most interest, accuracy measures will not be used; instead, subjective measures of what provides maximum insight may be most desirable. Some projects may use a combination of both so that the most accurate model is not selected if a less accurate but more transparent model with nearly the same accuracy is available.
For classification problems, the most frequent metrics for model selection in data mining include Percent Correct Classification (PCC); confusion matrix metrics such as precision and recall, sensitivity and specificity, Type I and Type II errors, and false alarms and false dismissals; and rank-ordered metrics such as Lift, Gain, ROC, and Area Under the Curve (AUC). AUC can be computed from any of the rank-ordered metrics.
PCC and the confusion matrix metrics are good when an entire population must be scored and acted on. Medical diagnoses are an example of this. If one will treat only a subset of the population, rank-ordering the population and acting on only a portion of those in that "select" group can be accomplished through metrics such as Lift, Gain, ROC, and AUC.
Also, any number of customized cost functions can be created from the quadrants of a confusion matrix. Most commonly, practitioners will weigh the quadrants to emphasize some errors over others as being particularly unwelcome. If one would like to reduce false alarms, for example, one could weigh these twice as much as false dismissals and create a single score base on the custom formula.
For continuous-valued estimation problems, metrics often used for assessing models are R^2, average error, Mean Squared Error (MSE), median error, average absolute error, and median absolute error. In each of these metrics, one first computes the error of an estimate, which is the actual value minus the predicted estimate. The metrics then sum errors over all the records in the data.
Average errors can be useful in determining whether the models are biased toward positive or negative errors. Average absolute errors are useful in estimating the magnitude of the errors (whether positive or negative). Analysts most often examine not only the overall value of the success criterion, but also examine the entire range of predicted values by considering scatter plots of actual versus predicted values or actual versus residuals (errors).
In principal, one can also include rank-ordered metrics such as AUC and Gain as candidates to estimate the success criteria, though they often are not included in data mining software for estimation problems. In these instances, one needs to create a customized success criterion.
Sometimes none of the typical success criteria are sufficient to evaluate predictive models because they do not match the business objective. Consider the invoice fraud example described earlier. Let's assume that the purpose of the model is to identify 100 invoices per month to investigate from the hundreds of thousands of invoices submitted. If one builds a classification model and selects a model that maximizes PCC, we can be fooled into thinking that the best model as assessed by PCC is good, even though none of the top 100 invoices are good candidates for investigation. How is this possible? If there are 100,000 invoices submitted in a month, we are selecting only 0.1 percent of them for investigation. The model could be perfect for 99.9 percent of the population and miss what we care about the most, the top 100.
In situations such as this one, when there are very specific needs for the organization, it is best to consider customized cost functions. In this instance, we want to identify a population of 100 invoices such that it maximizes the chances of these 100 invoices being true alerts (not false alarms). What metric does this? No metric addresses this directly, though ROC curves are close to the idea. Instead, the best way to rank the models is the direct method, that is, pick the model that maximizes the true fraud alert rate in the top 100 invoices of the scored population, ignoring the rest of the population. Data miners should adjust their algorithm settings appropriately to focus the attention of the classifiers on accuracy at the top of the predicted probabilities, such as weighting the cost of false alarms higher than the errors estimating true alerts.
Another candidate for customized scoring functions include Return On Investment (ROI) or profit, where there is a fixed or variable cost associated with the treatment of a customer or transaction (a record in the data), and a fixed or variable return or benefit if the customer responds favorably. For example, if one is building a customer acquisition model, the cost is typically a fixed cost associated with mailing or calling the individual; the return is the estimated value of acquiring a new customer. For fraud detection, there is a cost associated with investigating the invoice or claim, and a gain associated with the successful recovery of the fraudulent dollar amount.
Note that for many customized success criteria, the actual predicted values are not nearly as important as the rank order of the predicted values. If one computes the cumulative net revenue as a customized cost function associated with a model, the predicted probability may never enter into the final report, except as a means to threshold the population into the "select" group (that is to be treated) and the "nonselect" group.