#### Overview of this book

Data Analysis with Stata
Credits
www.PacktPub.com
Preface
Free Chapter
Introduction to Stata and Data Analytics
Stata Programming and Data Management
Data Visualization
Important Statistical Tests in Stata
Linear Regression in Stata
Logistic Regression in Stata
Survey Analysis in Stata
Time Series Analysis in Stata
Survival Analysis in Stata
Index

## Variables and data types

There are different types of variables and data types, which we are going to see in this section.

### Indicators or data variables

To find the insights and the data conclusions, the `browse`/`edit` command is helpful. Data variables store the fundamental data. As shown in the following table, the income data for different nations is stored in the `Cccgdp` variable and the country (`Countrycode`) data is stored in the pop variable. If we want to get an idea about the details of all kinds of data, then one indicator variable is needed. In the following case, `Countrycode` and `yr` will provide information regarding the country, the year, the country's GDP, and the population data (`pops`). The data might be as follows:

Country

Countrycode

Yr

Pops

Cccgdp

Openss

India

IND

2010

23452.9

10897.23

23.11111

U.S.

USA

2010

22222.1

23987.23

90.42231

Pakistan

PAK

2010

11111.2

23675.21

10.22291

China

CHN

2010

98765

97654.94

30.98765

Russia

RUS

2010

19876

65745.11

43.34343

Germany

GER

2010

23467

23874.35

23.74747

After importing the data in Stata, it is always a good practice to examine the data. It gives you an advantage in any modeling or visualization exercise.

### Examining the data

Examining the data is always recommended. It is a good idea to examine your data when you first read it into Stata; you should check whether all the variables and observations are present and are in the correct format.

While the `browse`/`edit` command is used to examine the raw data, the `list` command is used to see the results of the data. Listing small data is possible through this command. For bigger datasets, options are used to track the data. An example is shown as follows:

```List country* yr pops
```
```Country       countrycode     yr        pops
India         IND             2010      23452.9 |
U.S.          USA             2010      22222.1 |
Pakistan      PAK             2010      11111.2 |
China         CHN             2010      98765 |
Russia        RUS             2010      19876 |
Germany       GER             2010      23467 |```

In the preceding table, the star is called the placeholder, and it instructs Stata to incorporate the entire data with the country. Alternatively, we could focus on all variables but list only a limited number of observations, for example, the observation from 14th to 19th row:

The following table contains the country, country code, year, and pops 14/19:

Country

Countrycode

Yr

Popscon

Cccgdps

kOpenss

India

IND

2010

23452.9

10897.23

23.11111

U.S.

USA

2010

22222.1

23987.23

90.42231

Pakistan

PAK

2010

11111.2

23675.21

10.22291

China

CHN

2010

98765

97654.94

30.98765

Russia

RUS

2010

19876

65745.11

43.34343

Germany

GER

2010

23467

23874.35

23.74747

#### How to subset the data file using IN and IF

In the previous part, the `in` qualifier was used; it makes sure that the subset pertains to selected data. A lot of observations follow after this, for example:

• The list in 14/19

• The list in 90/l

• The list in 30/l

As is clear from the preceding example, there are three observations:

• The first command lists observations from 14 to 19

• The second command lists 90 observations

• The third command lists observations from 30 till the last observation

The `if` statement is the other way of subsetting data; it generally has values of true or false. The following is an example from the observation of the year 2010, where the variable name is yr:

```list if yr == 2010
```

In order to examine the raw data, the `browse` window is used. However, a problem occurs when only selected variables are to be viewed; this happens in big datasets. So, in this condition, create a list of the variables you want to examine before browsing. This is done through the following command:

```browse country yr popscon
```

It is important to note that this `edit` command will help change the dataset manually. The `assert` command helps Stata examine the observation. This is because when the bigger data (or big data, as it is called in today's world) arrives, checking single data through `browse` or `edit` commands becomes difficult. In this case, the `assert` command is helpful. There are a couple of advantages: it helps identify whether a data statement is right or wrong. For example, in the case of the population of the country (`popscon`), it will tell us that the values are positive:

```assert popscon>0,
assert popscon<0
```

If the preceding command results in the value true, then `assert` does not give any output. However, if the command value is false, then an error message will appear.

The `describe` command accounts for various fundamental information regarding datasets and variables, such as the total size of the dataset and the variable, the total number of variables in the dataset, and different formats of the variables. This can be denominated as `describe`. It can only be applied to an unread file in Stata. An example is given as follows:

```describe using "E:\Ind-Health-sample.dta"
```

Codebook can give information on variables in the dataset without the list of variables; an example of this is codebook country.

The `summarize` command delivers the statistics summary: means, standard deviation, and so on. The following table represents this tab:

```summarize table
Variable         Obs      Mean       Std. Dev.    Min         Max```
 Cntry 0 countrycode 0 Yr 97 2000 2.156 1990 2010 Popscon 97 87634.46 8374.33 29383.9 93830 ccCgdps 97 67544.23 4100.682 15890.71 98739.67 kOpenss 97 34 4 13 50 Chi-ppl 97 23.6 3.56 10.456 40.8796 Fdhsa 97 19.56 9.567 12.456 34.98765 Gdkliyu 97 1.987456 1.2 -3.238917 6.46896

As we can see in the preceding table, string variables such as `Cntry` and `Countrycode` do not have numbers; this is why no summary details are available. `Yr` is a numeric variable; therefore, we can see that it has a statistics summary. For more details, the summarize detail option can be used.

The wide range of graphic qualities makes Stata a unique tool. One can easily get help by typing the `help` command in Stata. A histogram graph can be created through the following command:

```graph twoway histogram cccgdps
```

For a scatter plot, you have to leverage the following command:

```graph two-way scatter ccccgdps popscon
```

Even though there is some benefit of having advanced graphs in Stata, this makes it work slowly. In certain cases, it is better to use version 7 graphics because they help visualize the data properly without using papers or presentations. This can be seen as follows:

```graph7 cccgdps popscon
```

Saving the dataset is a very easy command, and it is represented as follows:

```Save "E:\Stata1\t1 less India pwt 80-2010.dta", replace
```

If we have sets of files of the same content, then the `replace` tab/option can be helpful. It will swap the last version and save it. If the old version is to be stored for some reason, then save it with a different name. One thing that should be kept in mind is that the original file content can be changed if it is saved with revised datasets. Therefore, after changes are made to the revised file, in order to open the file and restart it, just reopen it.

There are two ways to preserve and store the data. One option is to save the current data and revise it, and later, if you don't want to keep the data, then `reopen` the saved data version. Another option is to use the `preserve` and `restore` functions/commands; they will take an image of the data, and the data will come back after you type `restore`.