Welcome to the second edition of Practical Data Science Cookbook. It was the positive feedback and usefulness that the book has found for its readers that made a second edition possible. When Packt asked me to co-author the second edition, I had a preview of some of its reviews across the web and immediately found the reasons for the popularity of the book and its little weakness. Thus, the current version retains the positives of the acceptance and removes the pain points as much as possible. The two new chapters: Chapter 10, German Credit Data Analysis and Chapter 11, Forecasting New Zealand Overseas Visitors are included to enhance the usefulness of the book.
We live in the age of data. As increasing amounts are generated each year, the need to analyze and create value from this asset is more important than ever. Companies that know what to do with their data and how to do it well will have a competitive advantage over companies that don't. Due to this, there will be an increasing demand for people who possess both the analytical and technical abilities to extract valuable insights from data and the business acumen to create valuable and pragmatic solutions that put these insights to use. This book provides multiple opportunities to learn how to create value from data through a variety of projects that run the spectrum of types of contemporary data science projects. Each chapter stands on its own, with step-by-step instructions that include screenshots, code snippets, and more detailed explanations where necessary and with a focus on process and practical application. The goal of this book is to introduce the data science pipeline, show you how it applies to a variety of different data science projects, and get you comfortable enough to apply it in future to projects of your own. Along the way, you'll learn different analytical and programming lessons, and the fact that you are working through an actual project while learning will help cement these concepts and facilitate your understanding of them.
Chapter 1, Preparing Your Data Science Environment, introduces the data science pipeline and helps you get your data science environment properly set up with instructions for the Mac, Windows, and Linux operating systems. This chapter is a guideline for setting up the environment for R and Python on the preceding platforms.
Chapter 2, Driving Visual Analysis with Automobile Data with R, takes you through the process of analyzing and visualizing automobile data to identify trends and patterns in fuel efficiency over time. The chapter will give you a taste of acquisition, exploration, munging, analysis, and communication. The concepts will be implemented in R.
Chapter 3, Creating Application-Oriented Analyses Using Tax Data and Python, shows you how to use Python to transition your analyses from one-off, custom efforts to reproducible and production-ready code using income distribution data as the base for the project.
Chapter 4, Modeling Stock Market Data, shows you how to build your own stock screener and use moving averages to analyze historical stock prices. You will learn how to acquire, summarize, clean, and generate relative evaluations of data.
Chapter 5, Visually Exploring Employment Data, shows you how to obtain employment and earnings data from the Bureau of Labor Statistics and conduct geospatial analysis at different levels with R. The same will be implemented using Python. The focus of this chapter is on the transformation, manipulation, and visualization of data.
Chapter 6, Driving Visual Analyses with Automobile Data, mirrors the automobile data analyses and visualizations in Chapter 2, Driving Visual Analysis with Automobile Data with R, but does so using the powerful programming language, Python. It focuses on the implementation of the analysis model using Python.
Chapter 7, Working with Social Graphs, shows you how to build, visualize, and analyze a social network that consists of comic book character relationships. You will also see the R and Python implementation.
Chapter 8, Recommending Movies at Scale (Python), walks you through building a movie recommender system with Python. You will also learn the R and Python code to implement a predictive model and the use of collaborative filtering to implement a predictive model.
Chapter 9, Harvesting and Geolocating Twitter Data (Python), shows you how to connect to the Twitter API and plot the geographic information contained in profiles. You will also learn the use of RESTful APIs in TextMining
Chapter 10, Forecasting New Zealand Overseas Visitors, explains how to create time series objects and describes various methods to visualize time series data. You will also learn how to build an appropriate model for the data and identify if the data has any trends and seasonal components.
Chapter 11, German Credit Data Analysis, demonstrates Exploratory Data Analysis (EDA), with a few basic tree methods and random forest. You will learn the method to apply EDA, tree-based methods and random forest on some particular data.
For this book, you will need a computer with access to the Internet and the ability to install the open source software needed for the projects. The primary software we will be using consists of the R and Python programming languages, with a myriad of freely available packages and libraries. Installation instructions are in the first chapter.
This book is intended for aspiring data scientists who want to learn data science and numerical programming concepts through hands-on, real-world projects. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of real-world data science projects and the programming examples in R and Python.
In this book, you will find several headings that appear frequently (Getting ready, How to do it, How it works, There's more, and See also). To give clear instructions on how to complete a recipe, we use these sections as follows.
This section tells you what to expect in the recipe, and describes how to set up any software or any preliminary settings required for the recipe.
This section usually consists of a detailed explanation of what happened in the previous section.
This section consists of additional information about the recipe in order to make the reader more knowledgeable about the recipe.
In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Create a new user for JIRA in the database and grant the user access to the jiradb
database we just created using the following command:"
A block of code is set as follows:
<Contextpath="/jira"docBase="${catalina.home} /atlassian- jira" reloadable="false" useHttpOnly="true">
Any command-line input or output is written as follows:
mysql -u root -p
New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: "Select System info
from the Administration
panel."
Feedback from our readers is always welcome. Let us know what you think about this book-what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of. To send us general feedback, simply e-mail [email protected]
, and mention the book's title in the subject of your message. If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.
Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.
You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. You can download the code files by following these steps:
- Log in or register to our website using your e-mail address and password.
- Hover the mouse pointer on the
SUPPORT
tab at the top. - Click on
Code Downloads & Errata
. - Enter the name of the book in the
Search
box. - Select the book for which you're looking to download the code files.
- Choose from the drop-down menu where you purchased this book from.
- Click on
Code Download
.
You can also download the code files by clicking on the Code Files
button on the book's webpage at the Packt Publishing website. This page can be accessed by entering the book's name in the Search
box. Please note that you need to be logged in to your Packt account. Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:
- WinRAR / 7-Zip for Windows
- Zipeg / iZip / UnRarX for Mac
- 7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/Practical-Data-Science-Cookbook-Second-Edition . We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/ . Check them out!
We also provide you with a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from https://www.packtpub.com/sites/default/files/downloads/PracticalDataScienceCookbookSecondEditon_ColorImages.pdf.
Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form
link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title. To view the previously submitted errata, go to https://www.packtpub.com/books/content/supportand enter the name of the book in the search field. The required information will appear under the Errata
section.
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy. Please contact us at [email protected]
with a link to the suspected pirated material. We appreciate your help in protecting our authors and our ability to bring you valuable content.
If you have a problem with any aspect of this book, you can contact us at [email protected]
, and we will do our best to address the problem.