Book Image

Spark for Data Science

By : Srinivas Duvvuri, Bikramaditya Singhal
Book Image

Spark for Data Science

By: Srinivas Duvvuri, Bikramaditya Singhal

Overview of this book

This is the era of Big Data. The words ‘Big Data’ implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages. Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R. With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.
Table of Contents (18 chapters)
Spark for Data Science
Credits
Foreword
About the Authors
About the Reviewers
www.PacktPub.com
Preface

Inferential statistics


We saw that descriptive statistics were extremely useful in describing and presenting data, but they did not provide a way to use the sample statistics to infer the population parameters or to validate any hypothesis we might have made. So, the techniques of inferential statistics surfaced to address such requirements. Some of the important uses of inferential statistics are:

  • Estimation of population parameters

  • Hypothesis testing

Please note that a sample can never represent a population perfectly because every time we sample, it naturally incurs sampling errors, hence the need for inferential statistics! Let us spend some time understanding the various types of probability distributions that can help infer the population parameters.

Discrete probability distributions

Discrete probability distributions are used to model data that is discrete in nature, which means that data can only take on certain values, such as integers. Unlike categorical variables, discrete variables...