Book Image

Data Science Using Python and R

By : Chantal D. Larose, Daniel T. Larose
Book Image

Data Science Using Python and R

By: Chantal D. Larose, Daniel T. Larose

Overview of this book

Data science is hot. Bloomberg named a data scientist as the ‘hottest job in America’. Python and R are the top two open-source data science tools using which you can produce hands-on solutions to real-world business problems, using state-of-the-art techniques. Each chapter in the book presents step-by-step instructions and walkthroughs for solving data science problems using Python and R. You’ll learn how to prepare data, perform exploratory data analysis, and prepare to model the data. As you progress, you’ll explore what are decision trees and how to use them. You’ll also learn about model evaluation, misclassification costs, naïve Bayes classification, and neural networks. The later chapters provide comprehensive information about clustering, regression modeling, dimension reduction, and association rules mining. The book also throws light on exciting new topics, such as random forests and general linear models. The book emphasizes data-driven error costs to enhance profitability, which avoids the common pitfalls that may cost a company millions of dollars. By the end of this book, you’ll have enough knowledge and confidence to start providing solutions to data science problems using R and Python.
Table of Contents (20 chapters)
Free Chapter
1
ABOUT THE AUTHORS
17
INDEX
18
END USER LICENSE AGREEMENT

DATA SCIENCE USING PYTHON AND R

Why this Book is Needed

Reason 1. Data Science is Hot. Really hot. Bloomberg called data scientist “the hottest job in America.”1 Business Insider called it “The best job in America right now.”2 Glassdoor.com rated it the best job in the world in 2018 for the third year in a row.3 The Harvard Business Review called data scientist “The sexiest job in the 21st century.”4

Reason 2: Top Two Open‐source Tools. Python and R are the top two open‐source data science tools in the world.5 Analysts and coders from around the world work hard to build analytic packages that Python and R users can then apply, free of charge.

Data Science Using Python and R will awaken your expertise in this cutting‐edge field using the most widespread open‐source analytics tools in the world. In Data Science Using Python and R, you will find step‐by‐step hands‐on solutions of real‐world business problems, using state‐of‐the‐art techniques. In short, you will learn data science by doing data science.

Written for Beginners and Non‐Beginners Alike

Data Science Using Python and R is written for the general reader, with no previous analytics or programming experience. We know that the information‐age economy is making many English majors and History majors retool to take advantage of the great demand for data scientists.6 This is why we provide the following materials to help those who are new to the field hit the ground running.

  • An entire chapter dedicated to learning the basics of using Python and R, for beginners. Which platform to use. Which packages to download. Everything you need to get started.
  • An appendix dedicated to filling in any holes you might have in your introductory data analysis knowledge, called Data Summarization and Visualization.
  • Step‐by‐step instructions throughout. Every instruction for every action.
  • Every chapter has Exercises, where you may check your understanding and progress.

Those with analytics or programming experience will enjoy having a one‐stop‐shop for learning how to do data science using both Python and R. Managers, CIOs, CEOs, and CFOs will enjoy being able to communicate better with their data analysts and database analysts. The emphasis in this book on accurately accounting for model costs will help everyone uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars.

Data Science Using Python and R covers exciting new topics, such as the following:

  • Random Forests,
  • General Linear Models, and
  • Data‐driven error costs to enhance profitability.

All of the many data sets used in the book are freely available on the book series website: DataMiningConsultant.com.

Data Science Using Python and R as a Textbook

Data Science Using Python and R naturally fits the role of textbook for a one‐semester course or two‐semester sequence of courses in introductory and intermediate data science. Faculty instructors will appreciate the exercises at the end of every chapter, totaling over 500 exercises in the book. There are three categories of exercises, from testing basic understanding toward more hands‐on analysis of new and challenging applications.

  • Clarifying the Concepts. These exercises test the students' basic understanding of the material, to make sure the students have absorbed what they have read.
  • Working with the Data. These applied exercises ask the student to work in Python and R, following the step‐by‐step instructions that were presented in the chapter.
  • Hands‐on Analysis. Here is the real meat of the learning process for the students, where they apply their newly found knowledge and skills to uncover patterns and trends in new data sets. Here is where the students' expertise is challenged, in near real‐world conditions. More than half of the exercises in the book consist of Hands‐on Analysis.

The following supporting materials are also available to faculty adopters of the book at no cost.

  • Full solutions manual, providing not just the answers, but how to arrive at the answers.
  • Powerpoint presentations of each chapter, so that you may help the students understand the material, rather than just assigning them to read it.

To obtain access to these materials, contact your local Wiley representation and ask them to email the authors confirming that you have adopted the book for your course.

Data Science Using Python and R is appropriate for advanced undergraduate or graduate‐level courses. No previous statistics, computer programming, or database expertise is required. What is required is a desire to learn.

How the Book is Structured

Data Science Using Python and R is structured around the Data Science Methodology.

The Data Science Methodology is a phased, adaptive, iterative, approach to the analysis of data, within a scientific framework.

  1. Problem Understanding Phase. First, clearly enunciate the project objectives. Then, translate these objectives into the formulation of a problem that can be solved using data science.
  2. Data Preparation Phase. Data cleaning/preparation is probably the most labor‐intensive phase of the entire data science process.
    • Covered in Chapter 3: Data Preparation.
  3. Exploratory Data Analysis Phase. Gain insights into your data through graphical exploration.
    • Covered in Chapter 4: Exploratory Data Analysis.
  4. Setup Phase. Establish baseline model performance. Partition the data. Balance the data, if needed.
    • Covered in Chapter 5: Preparing to Model the Data.
  5. Modeling Phase. The core of the data science process. Apply state‐of‐the‐art algorithms to uncover some seriously profitable relationships lying hidden in the data.
    • Covered in Chapters 6 and 814.
  6. Evaluation Phase. Determine whether your models are any good. Select the best‐performing model from a set of competing models.
    • Covered in Chapter 7: Model Evaluation.
  7. Deployment Phase. Interface with management to adapt your models for real‐world deployment.

Notes