Book Image

Practical Data Science Cookbook

By : Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta
Book Image

Practical Data Science Cookbook

By: Tony Ojeda, Sean Patrick Murphy, Benjamin Bengfort, Abhijit Dasgupta

Overview of this book

<p>As increasing amounts of data is generated each year, the need to analyze and operationalize it is more important than ever. Companies that know what to do with their data will have a competitive advantage over companies that don't, and this will drive a higher demand for knowledgeable and competent data professionals.</p> <p>Starting with the basics, this book will cover how to set up your numerical programming environment, introduce you to the data science pipeline (an iterative process by which data science projects are completed), and guide you through several data projects in a step-by-step format. By sequentially working through the steps in each chapter, you will quickly familiarize yourself with the process and learn how to apply it to a variety of situations with examples in the two most popular programming languages for data analysis—R and Python.</p>
Table of Contents (18 chapters)
Practical Data Science Cookbook
Credits
About the Authors
About the Reviewers
www.PacktPub.com
Preface
Index

Understanding the data


Understanding your data is critical to all data-related work. In this recipe, we acquire and take a first look at the data that we will be using to build our recommendation engine.

Getting ready

To prepare for this recipe, and the rest of the chapter, download the MovieLens data from the GroupLens website of the University of Minnesota. You can find the data at http://grouplens.org/datasets/movielens/.

In this chapter, we will use the smaller MoveLens 100k dataset (4.7 MB in size) in order to load the entire model into the memory with ease.

How to do it…

Perform the following steps to better understand the data that we will be working with throughout this chapter:

  1. Download the data from http://grouplens.org/datasets/movielens/. The 100K dataset is the one that you want (ml-100k.zip).

  2. Unzip the downloaded data into the directory of your choice.

  3. The two files that we are mainly concerned with are u.data, which contains the user movie ratings, and u.item, which contains movie...