Book Image

IBM SPSS Modeler Cookbook

Book Image

IBM SPSS Modeler Cookbook

Overview of this book

IBM SPSS Modeler is a data mining workbench that enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly allowing your organization to base its decisions on hard data not hunches or guesswork. IBM SPSS Modeler Cookbook takes you beyond the basics and shares the tips, the timesavers, and the workarounds that experts use to increase productivity and extract maximum value from data. The authors of this book are among the very best of these exponents, gurus who, in their brilliant and imaginative use of the tool, have pushed back the boundaries of applied analytics. By reading this book, you are learning from practitioners who have helped define the state of the art. Follow the industry standard data mining process, gaining new skills at each stage, from loading data to integrating results into everyday business practices. Get a handle on the most efficient ways of extracting data from your own sources, preparing it for exploration and modeling. Master the best methods for building models that will perform well in the workplace. Go beyond the basics and get the full power of your data mining workbench with this practical guide.
Table of Contents (17 chapters)
IBM SPSS Modeler Cookbook
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Produce a project plan – ensuring a realistic timeline by Keith McCormick


Since data mining is closely affiliated with other approaches, new practitioners of data mining often surprise their colleagues when they suggest that the data mining project will take many weeks. If one is confusing it with running ad hoc reports on an already identified problem area, the timeline discussed in this section may seem surprising. On one's very first data mining project, the new data miner may not feel fully prepared themselves. This section will explore each of the CRISP-DM phases and examine some considerations for estimating how long each phase will take. In general, something in the order of 8 to 20 weeks is probably a good rule of thumb.

Business understanding

There is a general agreement among veteran data miners that novice data miners do not spend enough time on business understanding. One of the challenges during business understanding is that it involves many voices throughout the enterprise. The junior analyst might not frequently attend meetings with C-level executives, but during this phase, it is always a good idea to work their way up the hierarchical chart until one reaches the decision maker who will be approving the actual deployed results of the data mining project. If the internal beneficiary of the project is not in attendance for at least one planning meeting, the project will probably be deployed via a slide presentation. One hopes that it is a slide presentation leavened with considerable insight, but it will still be just a slide presentation. If the project's ambition aims higher, it is critical to involve key decision making in the actual activity of business understanding, or at the very least, signing off on the conclusions of business understanding before the project begins in earnest.

As a result of having many players, business understanding cannot be rushed. This is not a phase where the data miner can burn the midnight oil and push forward on their own. It is good practice to assume the eight hours of progress each day may not be possible in this phase. Real progress can only be made in a group setting, and meetings can be hard to schedule. Although it may only require two to five days of solid work, it is a good idea to allow two weeks on the calendar to accomplish it. Finally, always remember to allow time to write the conclusions of each phase and to communicate those results to others. This is not to say that CRISP-DM phases come to an abrupt halt before the next phase can begin. This is not true. CRISP-DM tends to be highly iterative. Nonetheless, writing a business understanding report before moving on is being wise.

Phase

About the phase

Duration

Business understanding

2 to 5 days of solid work, but it is difficult to schedule. Give yourself plenty of ''calendar time".

1 to 2 weeks

Data understanding

It helps in understanding whether:

  • Analysts have direct access to the data

  • Analysts have ready access to IT support

  • There are delays in getting data

2 to 8 weeks

Data preparation

Famously estimated to be 70 to 90 percent of the actual work hours of the data mining lead. ''Clean'' data doesn't eliminate this need, although it helps. Tread carefully ... data preparation can explode your timeline.

3 to 10 weeks*

Modeling

No model is ever optimal but diminishing returns kick in pretty early. An experienced modeler can make tremendous progress in a week as long as there is no residual data preparation.

1 to 2 weeks

Evaluation

Evaluation is usually not very time-consuming, but the following two factors can change that:

  • Whether management needs a walk-through

  • Whether you want to conduct a ''dress rehearsal'' on the current data

1 week*

Deployment

This phase will answer the following questions:

  • What form will deployment take?

  • Will deployment involve changes to the data warehouse?

  • Will deployment be complex in real time?

1 to 6 weeks

Data understanding

Reasonable veterans of the process might quibble on this distinction, but a general tendency is for data understanding to be largely a group activity whereas data preparation is often a largely solitary one. Why? When the data miner is an outsider (either of the department or of the entire organization), he/she has to seek out allies who know some aspect of the organization's data that they do not. Even when a data miner and a subject matter expert are sitting side by side throughout the process, the most compelling data mining projects always involve some integration of data that has not been attempted before, leaving even the subject matter experts seeking help from others. When one only searches in standard tables that pre-exist the project, there is a considerable risk that what will be discovered has already been baked into a cake, that is, your discoveries might be limited to problems that have been at least partially addressed.

As a result, data understanding will tend to expand to the amount of time that you can allocate to it. One attempts to get access to as much relevant data as one can while still staying on schedule. One approach is to identify a true expert within each major department, getting an audience with them along with one's lead SME in attendance, and asking them what they have to offer. How long will this take? How many departments are there, and how many SMEs do you have to interview? A key step in the process is to mark the calendar with the arrival date of each piece of data—the contribution of each department. If this is not a factor in your project, the phase will go quicker. If you have direct access to all the data and do not rely on third parties to help you get it, the phase will go quicker.

Once the data is available in its raw form, you will have to explore it. Naturally, 50 variables tend be quicker than 1,200. While in statistics, one usually attempts all the univariates and bivariates, this is often impossible in data mining as the number of variables scale up. Still, some effort must be made to look at the individual pieces before data cleaning, data augmentation, and data integration can begin (during data preparation).

Don't forget to leave time for:

  • A round of questioning with SMEs during and after data exploration

  • Delays waiting for data to arrive a second time (if problems are found)

  • Documenting what you have found

  • Revisiting business understanding if you conclude that certain data is not available after all (or not available on schedule)

Data preparation

Where does all the time go? If the old adage that 70 to 90 percent of the project hours may be spent in activities associated with data understanding and this phrase is true, where are all of those hours being spent? The origin of this estimate is mysterious, and its source cannot be easily nailed down; however, even though most of them have acknowledged that there is no hard evidence to back it up, most agree that its percentage should be high. If your organization has super clean data stored in the latest data warehousing technology, will your experience be different, better? Does Big Data complicate, slow down data preparation?

The trick in understanding where the time goes is that data preparation comes in more than one flavor, and no data warehouse will eliminate all of it. CRISP-DM lists five: select, clean, construct, integrate, and format. Data in a data warehouse is there to support routine reporting and operating the business.

A really solid data warehouse may help with clean and format but will not likely help with select, construct, or integrate. Why? Let's investigate each in turn.

  • Select: Data will be usually selected for reasons that are driven by the goals of the project and as such will not have been anticipated by the teams that are in charge with long term storage of data. Most data miners agree that sampling is a critical component of data mining. It is not a technology issue, nor is it a question of the increasingly vast amounts of data, and not all of it is relevant. Good modeling often requires balancing and always requires partitioning (training and test partitions). In short, the size of your data warehouse since its inception is not a good indicator of the size of your training data set. Assessing this and choosing the right data to model takes time, and that time must be allocated.

  • Clean: If you have done a good job of building your data warehouse, this step is simplified. The less messy, the less time it would take; however, ensure that you make no mistake. Few data miners will ever encounter truly clean data in their careers. And even when it is clean, a null in the data warehouse might need to be a zero in the model of vice versa. Nonetheless, if the data has been thoroughly cleaned, this particular generic task might take less than a week. If it is not clean, there is almost no limit to the potential delays. One might even conclude that the data mining project must wait until the situation is addressed, first addressing the overall problem and only then cleaning for the purpose of a particular data mining project.

  • Construct: At the risk of generalizing too much, this is the phase that gets data miners in trouble; the time consumer that they don't see coming, the destroyer of time lines, the endangerer of model quality. Quite simply, the best variables in one's model probably don't exist at the start of the project. A variable such as previous month spend/rolling 12 months spend is a good example. No variable such as this is stored in one's data warehouse but can be terribly useful in predicting changes in behavior. Their relevance has to be slowly ferreted out, their formulas have to be crafted often with much trial and error, and their efficacy must be shown. Then they have to be paraded in front of the algorithms along with dozens or hundreds of their siblings, awaiting the conclusion of dozens of modeling attempts. When you're done with the modeling, you will notice that it is often these kind of variables that are populating the top 10 list.

  • Integrate: Most data miners anticipate this—the combining of data tables. In Modeler, this takes the form of merging, appending, and aggregating. However, one might assume that the queries that one needs already exist. Probably not. The data miner frequently has to go all the way back to the billing detail to calculate something that was lost in the aggregating done for reporting. In other words, it often has to be done all over again since the needs of reporting and the needs of the data miner are almost always different. Doing this over again takes time, usually a couple of weeks.

  • Format: How compatible are the formatting decisions you have already made with the needs of your data mining software? Did you anticipate all the variable declarations, data formatting, file formatting, naming conventions, and so on as they apply to the modeler? Frankly, it's unlikely. There are always little, and not so little conflicts, that pop up between the way departments store data or how data is parsed in Modeler. Modeler is a powerful software, but it is not immune to little formatting conflicts. Few data miners define this task as fun, but it is usually present at least to a degree. If the data warehouse is in good shape and your infrastructure is fairly current, this task might be no more than a bump in the road, taking relatively less amount of time. Data mining as a discipline has matured, and as Modeler has matured as well; this task is less and less of an issue. One might get away with only a data or so. An experienced user of Modeler spots these issues early and might avoid most of them before they happen. As many as 20 or 30 years ago, business users dreaded trying to install a new printer, but formatting compatibility becomes less of an issue with each new release of analytics software.

Modeling

Modeling is the phase that gets all the attention. Many books are dedicated to it. You would think that you need to allocate the vast majority of time to this phase. Not quite!

8th Law of Data Mining, by Tom Khabaza, the Value Law, states:

"The value of data mining results is not determined by the accuracy or stability of predictive models".

In the clever article of Dorian Pyle about rules not to follow, the fifth rule is Find the best algorithms in which he urges the reader to spend all of their available time on modeling.

The blog post written by Will Dwinell take on this, quoted later, is a third data miner's voice in the chorus. Certainly all the authors of this cookbook would join in saying that poor workmanship in the earlier phases will not be compensated for spending extra time on modeling. So how do you know how much time to spend? One to two weeks. There you go—a rare direct recommendation. If you have more time to spend than that, consider spending the time in another phase, particularly in data understanding and data preparation.

How should the time be spent? Expect to build several dozen models such as different algorithms, difference settings, balanced and not balanced, and variations on the target variable. The details could, and have, make a book, but you will hit economies of scale in that one- to two-week time frame, and often in just the first week. By then, you will be fighting for that last 1 percent of increase in accuracy. In time, with experience, the modeling phase may even start to feel like a valediction of sorts, eventually arriving at this phase with a sense of relief that the toughest part is over.

Evaluation

Will Dwinell wrote a blog post on this subject entitled Data Mining and Predictive Analytics:

Within the time allotted for any empirical modeling project, the analyst must decide how to allocate time for various aspects of the process. As is the case with any finite resource, more time spent on this means less time spent on that. I suspect that many modelers enjoy the actual modeling part of the job most. It is easy to try "one more" algorithm: Already tried logistic regression and a neural network? Try CART next.

Of course, more time spent on the modeling part of this means less time spent on other things. An important consideration for optimizing model performance, then, is: Which tasks deserve more time, and which less?

Experimenting with modeling algorithms at the end of a project will no doubt produce some improvements, and it is not argued here that such efforts be dropped. However, work done earlier in the project establishes an upper limit on model performance. I suggest emphasizing data clean-up (especially missing value imputation) and creative design of new features (ratios of raw features, etc.) as being much more likely to make the model's job easier and produce better performance.

A fair question is, what is the difference between the assessment task of modeling and the evaluation phase? Revisiting the 8th Law of Data Mining is in order. Tools such as Lift, AUC, overall accuracy, and so on, are just those tools. They help us sort through dozens of model variations that due diligence requires us to examine during the modeling phase. When we get to the evaluation phase, however, we must remind ourselves that companies don't earn ROI on Lift (well, certainly not directly, they don't). What was the purpose of the data mining model in the first place? Reduce marketing expense—well, how much does the model help us reduce marketing expense. In other words, in this phase, we must translate the data mining question back into the language of the business question and evaluate the efficacy in business terms.

Time to get that internal customer back into the meeting room. Only with the help of management can we choose from the semifinalists the models that emerged from the modeling phase showing promise on the technical criteria. It is fine to choose those models on the basis of a criterion that is one generation removed from the business criteria because evaluation can be time-consuming and might require the analyst to collaborate with members of the marketing team, or the finance team, or the operations team to get their numbers straight. One does not do that analysis on 50 or 100 models. But the number of promising models is narrowed down to the range of three to eight; it is time for the actual beneficiaries of the model with the business to help chose the final models. Although, it is possible for this to take just a week or so, it is also possible that it might take several weeks. Often evaluation takes the form of a "dress rehearsal"—that is for a month or more, running the business on two parallel tracks. The first track is the way it has always been done, and the second track is in a way informed by the chosen model.

Perhaps it is in an experimental region or for a randomly chosen group of customers, but it is the ultimate test. Deciding how to complete work on this phase depends on the nature of the stakes and the project's budget, but naturally the duration can vary wildly as a result of it. Most internal customers, however, will recognize that when you have entered this phase, the potential for saving or for revenue gain has already arrived, and that often the analysis team can breathe a sigh of relief at this point, surrendering part of the responsibility to other parties. If you are working in an organization that has access to IBM Collaboration and Deployment Services (C and DS) or IBM Decision Management, the role of those packages will wax as the role of modeler will begin to wane.

Deployment

Deployment truly deserves its own book, perhaps even its own cookbook. It is even more difficult to generalize about this phase than the others, because for other phases, we can assume that as a reader of this book you are using Modeler for data preparation and modeling. We cannot assume very much at all about how you plan to deploy. Will you deploy in real time or in batch? Will you have the IBM deployment resources mentioned at the end of the Evaluation section. If you are deploying directly to a modeler client, a week is probably realistic.

What is involved? In theory, all you are doing is changing the source data and running the exact same stream on the current data; however, in practice you will not want to do that. You will want to revisit your Modeler streams and ensure that only transformations (and there will be lots of them) that are necessary at deployment are present. In other words, remove derived nodes that represent inputs that were considered but rejected. You might also discover some efficiencies that escaped your attention during modeling when speed was not as important as accuracy. Now on to deployment where accuracy has been established but speed needs to be maximized.

You may have to call a meeting to discuss topics that might not sound like they are in the analyst's bailiwick such as interface design or executive dashboards. This is all for the good, frankly, because it ensures that the model will be utilized. Isn't that the goal? If more elaborate forms of deployment are considered such as real-time deployment, the deployment phase can be equal in time and scope to the other phases combined. All of the estimates of data preparation being 70 to 90 percent of the scope did not envision the intricacies of real-time deployment. It is almost like a second project but can absolutely be worth it because the possibilities for ROI are compelling.

Deployment is a good candidate for the second prize next to business understanding in the phase that far too many data miners invest too little in. No project can really achieve anything without either. Without business understanding, you run the real risk of solving the wrong problem. In the final phase, an impressive model that never rises above the status of a slide deck is always inferior to a competent model that is actually deployed.