Book Image

Statistics for Data Science

Book Image

Statistics for Data Science

Overview of this book

Data science is an ever-evolving field, which is growing in popularity at an exponential rate. Data science includes techniques and theories extracted from the fields of statistics; computer science, and, most importantly, machine learning, databases, data visualization, and so on. This book takes you through an entire journey of statistics, from knowing very little to becoming comfortable in using various statistical methods for data science tasks. It starts off with simple statistics and then move on to statistical methods that are used in data science algorithms. The R programs for statistical computation are clearly explained along with logic. You will come across various mathematical concepts, such as variance, standard deviation, probability, matrix calculations, and more. You will learn only what is required to implement statistics in data science tasks such as data cleaning, mining, and analysis. You will learn the statistical techniques required to perform tasks such as linear regression, regularization, model assessment, boosting, SVMs, and working with neural networks. By the end of the book, you will be comfortable with performing various statistical computations for data science programmatically.
Table of Contents (19 chapters)
Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface

Advantages of thinking like a data scientist


So why should you, a data developer, endeavor to think like (or more like) a data scientist? What is the significance of gaining an understanding of the ways and how's of statistics? Specifically, what might be the advantages of thinking like a data scientist?

The following are just a few notions supporting the effort for making the move into data science:

  • Developing a better approach to understanding data
  • Using statistical thinking during the process of program or database designing
  • Adding to your personal toolbox
  • Increased marketability
  • Perpetual learning
  • Seeing the future

Developing a better approach to understanding data

Whether you are a data developer, systems analyst, programmer/developer, or data scientist, or other business or technology professional, you need to be able to develop a comprehensive relationship with the data you are working with or designing an application or database schema for.

Some might rely on the data specifications provided to you as part of the overall project plan or requirements, and still, some (usually those with more experience) may supplement their understanding by performing some generic queries on the data, either way, this seldom is enough.

In fact, in industry case studies, unclear, misunderstood, or incomplete requirements or specifications consistently rank in the top five as reasons for project failure or added risk.

Profiling data is a process, characteristic of data science, aimed at establishing data intimacy (or a more clear and concise grasp of the data and its inward relationships). Profiling data also establishes context to which there are several general contextual categories, which can be used to augment or increase the value and understanding of data for any purpose or project.

These categories include the following:

  • Definitions and explanations: These help gain additional information or attributes about data points within your data
  • Comparisons: This help add a comparable value to a data point within your data
  • Contrasts: This help add an opposite to a data point to see whether it perhaps determines a different perspective
  • Tendencies: These are typical mathematical calculations, summaries, or aggregations
  • Dispersion: This includes mathematical calculations (or summaries) such as range, variance, and standard deviation, describing the average of a dataset (or group within the data)

Note

Think of data profiling as the process you may have used for examining data in a data file and collecting statistics and information about that data. Those statistics most likely drove the logic implemented in a program or how you related data in tables of a database.

Using statistical thinking during program or database designing

The process of creating a database design commonly involves several tasks that will be carried out by the database designer (or data developer). Usually, the designer will perform the following:

  1. Identify what data will be kept in the database.
  2. Establish the relationships between the different data points.
  3. Create a logical data structure to be used on the basis of steps 1 and 2.

Even during the act of application program designing, a thorough understanding of how the data works is essential. Without understanding average or default values, relationships between data points and grouping, and so on, the created application is at risk of failing.

One idea for applying statistical thinking to help with data designing is in the case where there is limited real data available. If enough data cannot be collected, one could create sample (test) data by a variety of sampling methods, such as probability sampling.

Note

A probability-based sample is created by constructing a list of the target population values, called a sample frame, then a randomized process for selecting records from the sample frame, which is called a selection procedure. Think of this as creating a script to generate records of sample data based on your knowledge of actual data as well as some statistical logic to be used for testing your designs.

Finally, approach any problem with scientific or statistical methods, and odds are you'll produce better results.

Adding to your personal toolbox

In my experience, most data developers tend to lock on to a technology or tool based upon a variety of factors (some of which we mentioned earlier in this chapter) becoming increasingly familiar with and (hopefully) more proficient with the product, tool, or technology—even the continuously released newer versions. One might suspect that (and probably would be correct) the more the developer uses the tool, the higher the skill level that he or she establishes. Data scientists, however, seem to lock onto methodologies, practices, or concepts more than the actual tools and technologies they use to implement them.

This turning of focus (from to tool to technique) changes one's mindset to the idea of thinking what tool best serves my objective rather than how this tool serves my objective.

Note

The more tools you are exposed to, the broader your thinking will become a developer or data scientist. The open source community provides outstanding tools you can download, learn, and use freely. One should adopt a mindset of what's next or new to learn, even if it's in an attempt to compare features and functions of a new tool to your preferred tool. We'll talk more about this in the perpetual learning section of this chapter.

An exciting example of a currently popular data developer or data enabling tool is MarkLogic (http://www.marklogic.com/). This is an operational and transactional enterprise NoSQL database that is designed to integrate, store, manage, and search more data than ever before. MarkLogic received the 2017 DAVIES Award for best Data Development Tools. R and Python seem to be at the top as options for the data scientists.

Note

It would not be appropriate to end this section without the mention of IBM Watson Analytics (https://www.ibm.com/watson/), currently transforming the way the industry thinks about statistical or cognitive thinking.

Increased marketability

Data science is clearly an ever-evolving field, with exponentially growing popularity. In fact, I'd guess that if you ask a dozen professionals, you'll most likely receive a dozen different definitions of what a data scientist is (and their place within a project or organization), but most likely, all would agree with their level of importance and that vast numbers of opportunities exist within the industry and the world today.

Data scientist face an unprecedented demand for more models, more insights...there's only one way to do that: They have to dramatically speed up the insights to action. In the future data Scientists, must become more productive. That's the only way they're going to get more value from the data.                                                                                                                                -Gualtieri        https://www.datanami.com/2015/09/18/the-future-of-data-science/

Data Scientist is relatively hard to find today. If you do your research, you will find that today's data scientists may have a mixed background consisting of mathematics, programming, and software design, experimental design, engineering, communication, and management skills. In practice, you'll see that most data scientists you find aren't specialists in any one aspect, rather they possess varying levels of proficiency in several areas or backgrounds.

The role of the data scientist has unequivocally evolved since the field of statistics of over 1200 years ago. Despite the term only existing since the turn of this century, it has already been labeled The Sexiest Job of the 21st Century, which understandably, has created a queue of applicants stretched around the block                                                                                                                                 -Pearson       https://www.linkedin.com/pulse/evolution-data-scientist-chris-pearson

 

Note

Currently, there is no official data scientist job description (or prerequisite list for that matter). This presents you with the opportunity to create your own flavour of the data scientist, delivering value in new ways to your organization.

Perpetual learning

The idea of continued assessment or perpetual learning is an important statistical concept to grasp. Consider learning enhanced skills of perception as a common definition. For example, in statistics, we can refer to the idea of cross-validation. This is a statistical approach for measuring (assessing) a statistical model's performance. This practice involves identifying a set of validation values and then running a model a set number of rounds (continuously), using sample datasets and then averaging the results of each round to ultimately see how good a model (or approach) might be in solving a particular problem or meeting an objective.

The expectation here is that given performance results, adjustments could be made to tweak the model so as to provide the ability to identify insights when used with a real or full population of data. Not only is this concept a practice the data developer should use for refining or fine-tuning a data design or data-driven application process, but this is great life advice in the form of try, learn, adjust, and repeat.

Note

The idea of model assessment is not unique to statistics. Data developers might consider this similar to the act of predicting SQL performance or perhaps the practice of an application walkthrough where an application is validated against the intent and purpose stated within its documented requirements.

Seeing the future

Predictive modeling uses the statistics of data science to predict or foresee a result (actually, a probable result). This may sound a lot like fortune telling, but it is more about putting to use cognitive reasoning to interpret information (mined from data) to draw a conclusion. In the way that a scientist might be described as someone who acts in a methodical way, attempting to obtain knowledge or to learn, a data scientist might be thought of as trying to make predictions, using statistics and (machine) learning.

Note

When we talk about predicting a result, it's really all about the probability of seeing a certain result. Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.

If you are a data developer who has perhaps worked on projects serving an organization's office of finance, you may understand why a business leader would find it of value to not just report on its financial results (even the most accurate of results are really still historical events) but also to be able to make educated assumptions on future performance.

Perhaps you can understand that if you have a background in and are responsible for financial reporting, you can now take the step towards providing statistical predictions to those reports!

Note

Statistical modeling techniques can also be applied to any type of unknown event, regardless of when it occurred, such as in the case of crime detection and suspect identification.