Data Collection
The first question you should be asking before starting any project is "What is my question?" If you don't know your question, then you won't know how to get an answer. In science and statistics, this is called having a hypothesis. Typical hypotheses might be:
- Is smoking related to lung cancer?
- Is there an association between sales of ice cream and haemorrhoid cream?
- Is there a correlation between coffee consumption and insomnia?
It's important to start with a question, because this will help you decide what data you should collect (and what data you shouldn't).
It's not usual that you can answer these types of question by collecting data on just those variables. It's much more likely that there will be other factors that may have an influence on the answer and all of these factors must be taken into account. If you want to answer the question is smoking related to lung cancer? then you'll typically also collect data on age, height, weight, family history, genetic factors, and environmental factors, and your dataset will start to become quite large in comparison with your hypothesis.
So, what data should you collect? Well, that depends on your hypothesis, the perceived wisdom of current thinking, and any previous research carried out, but ultimately, if you collect data sensibly, you will likely get sensible results and vice versa, so it's a good idea to take some time to think it through carefully before you start.
I'm not going to go into the finer points of data collection and cleaning here, but it's important that your dataset conforms to a few simple standards before you can start analyzing it.
By the way, if you want a copy of my book Practical Data Cleaning, you can get a free copy of it by following the instructions in the tiny little advert for it at the end of this section…
Dataset Checklist
OK, so here we go. Here are the essential features of a ready-to-go dataset for association and correlation analysis.
Your dataset is a rectangular matrix of data. If your data is spread across different spreadsheets or tables, then it's not a dataset, it's a database, and it's not ready for analysis:
- Each column of data is a single variable corresponding to a single piece of information (such as age, height, or weight, in this case).
- Column 1 is a list of unique consecutive numbers starting from one. This allows you to uniquely identify any given row and recover the original order of your dataset with a single sort command.
- Row 1 contains the names of the variables. If you use rows 2, 3, 4, and so on as the variable names, you won't be able to enter your dataset into a statistics program.
- Each row contains the details for a single sample (patient, case, test tube, and so on).
- Each cell should contain a single piece of information. If you have entered more than one piece of information in a cell (such as date of birth and their age), then you should separate the column into two or more columns (one for date of birth, another for age).
- Don't enter the number zero into a cell unless what has been measured, counted, or calculated results in the answer zero. Don't use the number zero as a code to signify "No Data". By now, you should have a well-formed dataset that is stored in a single Excel worksheet. Each column should be a single variable, with row 1 containing the names of the variables, and below this, each row should be a distinct sample or patient. It should look something like Figure 1.1.
Figure 1.1: A typical dataset used in association and correlation analysis
For the rest of this book, this is how I assume your dataset is laid out, so I might use the terms variable and column interchangeably, the same going for the terms row, sample, and patient.