Book Image

IBM SPSS Modeler Essentials

By : Jesus Salcedo, Keith McCormick
Book Image

IBM SPSS Modeler Essentials

By: Jesus Salcedo, Keith McCormick

Overview of this book

IBM SPSS Modeler allows users to quickly and efficiently use predictive analytics and gain insights from your data. With almost 25 years of history, Modeler is the most established and comprehensive Data Mining workbench available. Since it is popular in corporate settings, widely available in university settings, and highly compatible with all the latest technologies, it is the perfect way to start your Data Science and Machine Learning journey. This book takes a detailed, step-by-step approach to introducing data mining using the de facto standard process, CRISP-DM, and Modeler’s easy to learn “visual programming” style. You will learn how to read data into Modeler, assess data quality, prepare your data for modeling, find interesting patterns and relationships within your data, and export your predictions. Using a single case study throughout, this intentionally short and focused book sticks to the essentials. The authors have drawn upon their decades of teaching thousands of new users, to choose those aspects of Modeler that you should learn first, so that you get off to a good start using proven best practices. This book provides an overview of various popular data modeling techniques and presents a detailed case study of how to use CHAID, a decision tree model. Assessing a model’s performance is as important as building it; this book will also show you how to do that. Finally, you will see how you can score new data and export your predictions. By the end of this book, you will have a firm understanding of the basics of data mining and how to effectively use Modeler to build predictive models.
Table of Contents (19 chapters)
Title Page
Credits
About the Authors
About the Reviewer
www.PacktPub.com
Customer Feedback
Dedication
Preface

Preface

We are proud to present this intentionally short book on the essentials of using IBM SPSS Modeler. Data science and predictive analytics are hot topics right now, and this book might be perceived as being inspired by these new and exciting trends. While we certainly hope to attract a variety of readers, including young practitioners that are new to the field, in actuality the contents of this book have been shaped by a variety of forces that have been unfolding over a period of approximately 25 years.

In 1992, Colin Shearer and his colleagues, then at ISL, were finding, as Colin himself described it, that data mining projects involved a lot of hard work, and that most of that work was boring. Specifically, to get to the rewarding tasks of finding patterns using the modeling algorithms you had to do a lot of repetitive preparatory work. It was this observation—that virtually all data mining projects share some of the same routine operations—that gave birth to the idea of the first data mining workbench, Clementine (now called IBM SPSS Modeler). The software was designed to make the repetitive tasks as quick and easy as possible. It is that same observation that is at the heart of this book. We have carefully chosen those tasks that apply to nearly all Modeler projects. For that reason, this book is decidedly not encyclopedic, and we sincerely hope that you can outgrow this book in short order and can then move on to more advanced features of Modeler and explore its powerful collection of features.

Another inspiration for this book is the history of Clementine documentation and training from the early 1990s to the present. Given the motivation behind the software, early documentation often focused on short, simple examples that could be carefully followed and then imitated in real-world examples, even though the real-world applications were always much more complex. Some of the earliest examples of this were the original Clementine Application Templates (ISL CATs) from the 1990s, which have evolved so much as to be unrecognizable.

The two of us first encountered Modeler as members of the SPSS community in the period between SPSS's acquisition of ISL (1998) and IBM's acquisition of SPSS (2009). We were both extensively involved in Modeler training for SPSS. Jesus was the training curriculum lead for IBM SPSS at one point after the acquisition. It soon became clear that training in Modeler was going to evolve after the acquisition and more and more entities were going to be involved in training. Some years later, we found ourselves working together at an IBM partner and built a complete SPSS Statistics and SPSS Modeler curriculum for that company. We have spent hundreds of hours discussing Modeler training and thousands of hours conducting Modeler training. We are passionate about how to create the ideal experience for new users of Modeler. We anticipate that the readers of this book will be brand new users engaged in self-study, students in classes that use Modeler, or participants in short courses and seminars such as the ones that we have taught for years.

In 2010, also in response to the changing marketplace after the IBM acquisition, Tom Khabaza (data mining pioneer and one of the earliest members of the ISL/Clementine team) and Keith started a dialog about a possible rookie book about SPSS Modeler. We knew that Modeler might be reaching new audiences. We had spirited discussions and produced a detailed outline, but the project never quite got off the ground. In 2011, without any knowledge of our beginner's guide concept, Packt reached out to Keith and wanted him to recruit others to write a more advanced Modeler book in a cookbook format. At first, Tom and Keith resisted because we thought that a beginner's guide was badly needed and we had an existing plan. However, it all worked out in the end. We combined forces with almost a dozen Modeler experts, including Colin Shearer, who kindly wrote the foreword. Jesus and other experts we knew joined as either co-authors or technical reviewers. The success of the IBM SPSS Modeler Cookbook (2013) demonstrated that more advanced content was also needed.

This book would have been completely different if it had been written before the cookbook. Knowing that the cookbook exists has allowed us to stick to our goal of writing a quick and easy read with only the absolute essentials. It has been designed to dovetail nicely with the cookbook and serve as a kind of prequel. In designing this book, we were quite consciously aware that many people who read this book might use our IBM SPSS Modeler EssentialsPackt video course as a companion. Since we tried to prioritize the absolute essentials in both, they necessarily cover similar ground. However, we chose different case study datasets for each, precisely to support the kind of learning that would come from working through both. We truly believe that they complement each other.

In that spirit, we have chosen a single case study to use throughout the book. It is just complex enough to suit our purposes, but clearly falls short of the complexity of a real-world example. This is a conscious decision. Work through this book. It is designed to be an experience, and not just a read, so follow it step by step from cover to cover. While we hope this book may also be useful to refer to later, we are trying to craft a positive (and easy) first-time experience with Modeler. Also, although we offer a sufficiently complex dataset to show the essentials, we do not attempt to fashion an elaborate scenario to place the dataset into a business context. This is also a conscious decision. We felt that a book on the essentials of Modeler should be a much more point and click book than a theory book. So if you want a book that emphasizes theory over practice, this may not be the best choice to begin your journey. We do rehearse the basic steps behind how modeling works in Modeler, but given the book's length, there is simply no room to discuss all the algorithms and the theory behind them in this book. We spend virtually all of the book pages on Data Understanding, Data Preparation, Modeling and Model Assessment, and spend virtually no pages on Business Understanding, Business Evaluation, and Deployment. Having said that, we care deeply about helping the reader understand why they are performing each step, and will always place the point and click steps in a proper context. That is why we are so carefully selective about how many steps, and which steps, we include in this short book.

IBM SPSS Modeler enables you to explore data, identify important relationships that you can leverage, and build predictive models quickly, allowing your organization to base its decisions purely on the insights obtained from your data. It is our hope that you enjoy mining your data with Modeler and that this book serves as your guide to get you started on this journey. We sincerely hope that you enjoy learning from this book as much as we have enjoyed teaching its content.

What this book covers

Chapter 1, Introduction to Data Mining, introduces the notion of data mining and the CRISP-DM process model. You will learn what data mining is, why you would want to use it, and some of the types of questions you could answer with data mining.

Chapter 2, The Basics of Using IBM SPSS Modeler, introduces the Modeler graphic user interface. You will learn where different components of the program are located, how to work with nodes and create streams, and how to use various help options.

Chapter 3, Importing Data into Modeler, introduces the general data structure that is used in Modeler. You will learn how to read and display data, and you will be introduced to the concepts of measurement level and field roles.

Chapter 4, Data Quality and Exploration, focuses on the Data Understanding phase of data mining. We will spend some time exploring our data and assessing its quality. This chapter introduces the Data Audit node, which is used to explore and assess data. You will see this node's options and learn how to look over its results. You will also be introduced to the concept of missing data and will be shown ways to address it.

Chapter 5, Cleaning and Selecting Data, introduces the Data Preparation phase, so we can fix some of the problems that were previously identified during the Data Understanding phase. You will be shown how to select the appropriate cases for analysis, how to sort cases to get a better feel for the data, how to identify and remove duplicate cases, and how to reclassify categorical values to address various types of issues.

Chapter 6, Combining Data Files, continues with the Data Preparation phase of data mining by filtering fields and combining different types of data files.

Chapter 7Deriving New Fields, introduces the Derive node. The Derive node can perform different types of calculations so that users can extract more information from the data. These additional fields can then provide insights that may not have been apparent. In this chapter, you will learn that the Derive node can create fields as formulas, flags, nominals, or conditionals.

Chapter 8, Looking for Relationships between Fields, focuses on discovering simple relationships between an outcome variable and a predictor variable. You will learn how to use several statistical and graphing nodes to determine which fields are related to each other. Specifically, you will learn to use the Distribution and Matrix nodes to assess the relationship between two categorical variables. You will also learn how to use the Histogram and Means nodes to identify the relationship between categorical and continuous fields. Finally, you will be introduced to the Plot and Statistics nodes to investigate relationships between continuous fields.

Chapter 9, Introduction to Modeling Options in IBM SPSS Modeler, introduces the different types of models available in Modeler and then provides an overview of the predictive models. Readers will also be introduced to the Partition node so that they can create Training and Testing datasets.

Chapter 10, Decision Tree Models, introduces readers to the decision tree theory. It then provides an overview of the CHAID model so that readers become familiar with the theory, dialogs, and results of this model.

Chapter 11, Model Assessment and Scoring, speaks about assessing the results once a model has been built. This chapter discusses different ways of assessing the results of a model. Readers will also learn how to score new data and how to export these predictions.

What you need for this book

This book introduces students to the steps of data analysis. Students do not need to be experienced in analyzing data; however, an introductory statistics or data mining course would be helpful since this book's emphasis will be the point and click operations in Modeler, and neither statistical nor data mining theory. We will carefully cover theory as needed to help you understand why we are performing each of the steps in the case study, so you can safely start with this book as your very first book. However, a single case study will not provide a complete theoretical context for data mining and we will only use a single modeling algorithm in any detail.

Software demonstrations will be performed on IBM SPSS Modeler; thus, having access to Modeler is critical to enable you to follow along with step-by-step instructions. While we recognize that you might make a first pass at this content away from your computer, you should try each and every step in the book in Modeler. We have carefully narrowed the material down to the essentials. You should find that every step will serve you well when you apply what you've learned to your own data and your own situation. Since we've kept it to the basics, you should have no problem completing the entire book during the time period of a trial license of Modeler, if you do not have permanent access to Modeler. If you encounter this material in a university setting, you may be eligible for a student version.

If you don't have Modeler yet, you might want to consider watching the Packt IBM SPSS Modeler Essentials Video first, then installing the trial version, and then working through the book step by step. Since the two case studies are different, this will provide excellent reinforcement of the material. You will see virtually every concept twice, but with different datasets.

Who this book is for

This book is ideal for those who are new to SPSS Modeler and want to start using it as quickly as possible, without going into too much detail. An understanding of basic data mining concepts will be helpful to get the best out of the book.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "We can include other contexts through the use of the include directive".

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "In order to download new modules, we will go to Files | Settings | Project Name | Project Interpreter."

Note

Warnings or important notes appear like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply email [email protected], and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files emailed directly to you. You can download the code files by following these steps:

  1. Log in or register to our website using your email address and password.
  2. Hover the mouse pointer on the SUPPORT tab at the top.
  3. Click on Code Downloads & Errata.
  4. Enter the name of the book in the Search box.
  5. Select the book for which you're looking to download the code files.
  6. Choose from the drop-down menu where you purchased this book from.
  7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

  • WinRAR / 7-Zip for Windows
  • Zipeg / iZip / UnRarX for Mac
  • 7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/IBM-SPSS-Modeler-Essentials. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book

We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/IBMSPSSModelerEssentials_ColorImages.pdf.

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected] with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.

Questions

If you have a problem with any aspect of this book, you can contact us at [email protected], and we will do our best to address the problem.