Associations and Correlations

Associations and Correlations

By : Lee Baker

Buy this Book

Associations and Correlations

By: Lee Baker

Buy this Book

Overview of this book

Associations and correlations are ways of describing how a pair of variables change together as a result of their connection. By knowing the various available techniques, you can easily and accurately discover and visualize the relationships in your data. This book begins by showing you how to classify your data into the four distinct types that you are likely to have in your dataset. Then, with easy-to-understand examples, you’ll learn when to use the various univariate and multivariate statistical tests. You’ll also discover what to do when your univariate and multivariate results do not match. As the book progresses, it describes why univariate and multivariate techniques should be used as a tag team, and also introduces you to the techniques of visualizing the story of your data. By the end of the book, you’ll know exactly how to select the most appropriate univariate and multivariate tests, and be able to use a single strategic framework to discover the true story of your data.

About the Book

Free Chapter

Data Collection and Cleaning

Data Collection

Data Cleaning

Data Classification

Quantitative and Qualitative Data

Introduction to Associations and Correlations

Univariate Statistics

Correlations

Associations

Survival Analysis

Multivariate Statistics

Types of Multivariate Analysis

Using Multivariate Tests as Univariate Tests

Univariate versus Multivariate Analyses

Limitations and Assumptions of Multivariate Analysis

Creating Predictive Models with the Results of Multivariate Analyses

Visualizing Your Relationships

A Holistic Strategy to Discover Independent Relationships

Visualizing the Story of Your Data

Bonus: Automating Associations and Correlations

What is the Problem?

CorrelViz

Appendix

Customer Reviews

5 star

4 star

3 star

2 star

1 star

Data Collection

The first question you should be asking before starting any project is "What is my question?" If you don't know your question, then you won't know how to get an answer. In science and statistics, this is called having a hypothesis. Typical hypotheses might be:

Is smoking related to lung cancer?
Is there an association between sales of ice cream and haemorrhoid cream?
Is there a correlation between coffee consumption and insomnia?

It's important to start with a question, because this will help you decide what data you should collect (and what data you shouldn't).

It's not usual that you can answer these types of question by collecting data on just those variables. It's much more likely that there will be other factors that may have an influence on the answer and all of these factors must be taken into account. If you want to answer the question is smoking related to lung cancer? then you'll typically also collect data on age, height, weight, family history, genetic factors, and environmental factors, and your dataset will start to become quite large in comparison with your hypothesis.

So, what data should you collect? Well, that depends on your hypothesis, the perceived wisdom of current thinking, and any previous research carried out, but ultimately, if you collect data sensibly, you will likely get sensible results and vice versa, so it's a good idea to take some time to think it through carefully before you start.

I'm not going to go into the finer points of data collection and cleaning here, but it's important that your dataset conforms to a few simple standards before you can start analyzing it.

By the way, if you want a copy of my book Practical Data Cleaning, you can get a free copy of it by following the instructions in the tiny little advert for it at the end of this section…

Dataset Checklist

OK, so here we go. Here are the essential features of a ready-to-go dataset for association and correlation analysis.

Your dataset is a rectangular matrix of data. If your data is spread across different spreadsheets or tables, then it's not a dataset, it's a database, and it's not ready for analysis:

Each column of data is a single variable corresponding to a single piece of information (such as age, height, or weight, in this case).
Column 1 is a list of unique consecutive numbers starting from one. This allows you to uniquely identify any given row and recover the original order of your dataset with a single sort command.
Row 1 contains the names of the variables. If you use rows 2, 3, 4, and so on as the variable names, you won't be able to enter your dataset into a statistics program.
Each row contains the details for a single sample (patient, case, test tube, and so on).
Each cell should contain a single piece of information. If you have entered more than one piece of information in a cell (such as date of birth and their age), then you should separate the column into two or more columns (one for date of birth, another for age).
Don't enter the number zero into a cell unless what has been measured, counted, or calculated results in the answer zero. Don't use the number zero as a code to signify "No Data". By now, you should have a well-formed dataset that is stored in a single Excel worksheet. Each column should be a single variable, with row 1 containing the names of the variables, and below this, each row should be a distinct sample or patient. It should look something like Figure 1.1.

Figure 1.1: A typical dataset used in association and correlation analysis

For the rest of this book, this is how I assume your dataset is laid out, so I might use the terms variable and column interchangeably, the same going for the terms row, sample, and patient.

Associations and Correlations

By : Lee Baker

Associations and Correlations

By: Lee Baker

Overview of this book

Related Content you might be interested in

Current Title:

Associations and Correlations

Building Statistical Models in Python

Principles of Data Science

Essential Statistics for Non-STEM Data Analysts

Data Collection

Dataset Checklist

Figure 1.1: A typical dataset used in association and correlation analysis