Book Image

Data Analysis with STATA

Book Image

Data Analysis with STATA

Overview of this book

Table of Contents (16 chapters)
Data Analysis with Stata
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Variables and data types


There are different types of variables and data types, which we are going to see in this section.

Indicators or data variables

To find the insights and the data conclusions, the browse/edit command is helpful. Data variables store the fundamental data. As shown in the following table, the income data for different nations is stored in the Cccgdp variable and the country (Countrycode) data is stored in the pop variable. If we want to get an idea about the details of all kinds of data, then one indicator variable is needed. In the following case, Countrycode and yr will provide information regarding the country, the year, the country's GDP, and the population data (pops). The data might be as follows:

Country

Countrycode

Yr

Pops

Cccgdp

Openss

India

IND

2010

23452.9

10897.23

23.11111

U.S.

USA

2010

22222.1

23987.23

90.42231

Pakistan

PAK

2010

11111.2

23675.21

10.22291

China

CHN

2010

98765

97654.94

30.98765

Russia

RUS

2010

19876

65745.11

43.34343

Germany

GER

2010

23467

23874.35

23.74747

After importing the data in Stata, it is always a good practice to examine the data. It gives you an advantage in any modeling or visualization exercise.

Examining the data

Examining the data is always recommended. It is a good idea to examine your data when you first read it into Stata; you should check whether all the variables and observations are present and are in the correct format.

While the browse/edit command is used to examine the raw data, the list command is used to see the results of the data. Listing small data is possible through this command. For bigger datasets, options are used to track the data. An example is shown as follows:

List country* yr pops
Country       countrycode     yr        pops 
India         IND             2010      23452.9 |
U.S.          USA             2010      22222.1 |
Pakistan      PAK             2010      11111.2 |
China         CHN             2010      98765 |
Russia        RUS             2010      19876 |
Germany       GER             2010      23467 |

In the preceding table, the star is called the placeholder, and it instructs Stata to incorporate the entire data with the country. Alternatively, we could focus on all variables but list only a limited number of observations, for example, the observation from 14th to 19th row:

The following table contains the country, country code, year, and pops 14/19:

Country

Countrycode

Yr

Popscon

Cccgdps

kOpenss

India

IND

2010

23452.9

10897.23

23.11111

U.S.

USA

2010

22222.1

23987.23

90.42231

Pakistan

PAK

2010

11111.2

23675.21

10.22291

China

CHN

2010

98765

97654.94

30.98765

Russia

RUS

2010

19876

65745.11

43.34343

Germany

GER

2010

23467

23874.35

23.74747

How to subset the data file using IN and IF

In the previous part, the in qualifier was used; it makes sure that the subset pertains to selected data. A lot of observations follow after this, for example:

  • The list in 14/19

  • The list in 90/l

  • The list in 30/l

As is clear from the preceding example, there are three observations:

  • The first command lists observations from 14 to 19

  • The second command lists 90 observations

  • The third command lists observations from 30 till the last observation

The if statement is the other way of subsetting data; it generally has values of true or false. The following is an example from the observation of the year 2010, where the variable name is yr:

list if yr == 2010

In order to examine the raw data, the browse window is used. However, a problem occurs when only selected variables are to be viewed; this happens in big datasets. So, in this condition, create a list of the variables you want to examine before browsing. This is done through the following command:

browse country yr popscon

It is important to note that this edit command will help change the dataset manually. The assert command helps Stata examine the observation. This is because when the bigger data (or big data, as it is called in today's world) arrives, checking single data through browse or edit commands becomes difficult. In this case, the assert command is helpful. There are a couple of advantages: it helps identify whether a data statement is right or wrong. For example, in the case of the population of the country (popscon), it will tell us that the values are positive:

assert popscon>0,
assert popscon<0

If the preceding command results in the value true, then assert does not give any output. However, if the command value is false, then an error message will appear.

The describe command accounts for various fundamental information regarding datasets and variables, such as the total size of the dataset and the variable, the total number of variables in the dataset, and different formats of the variables. This can be denominated as describe. It can only be applied to an unread file in Stata. An example is given as follows:

describe using "E:\Ind-Health-sample.dta"

Codebook can give information on variables in the dataset without the list of variables; an example of this is codebook country.

The summarize command delivers the statistics summary: means, standard deviation, and so on. The following table represents this tab:

summarize table
Variable         Obs      Mean       Std. Dev.    Min         Max

Cntry

0

 

countrycode

0

Yr

97

2000

2.156

1990

2010

Popscon

97

87634.46

8374.33

29383.9

93830

ccCgdps

97

67544.23

4100.682

15890.71

98739.67

kOpenss

97

34

4

13

50

Chi-ppl

97

23.6

3.56

10.456

40.8796

Fdhsa

97

19.56

9.567

12.456

34.98765

Gdkliyu

97

1.987456

1.2

-3.238917

6.46896

As we can see in the preceding table, string variables such as Cntry and Countrycode do not have numbers; this is why no summary details are available. Yr is a numeric variable; therefore, we can see that it has a statistics summary. For more details, the summarize detail option can be used.

The wide range of graphic qualities makes Stata a unique tool. One can easily get help by typing the help command in Stata. A histogram graph can be created through the following command:

graph twoway histogram cccgdps

For a scatter plot, you have to leverage the following command:

graph two-way scatter ccccgdps popscon

Even though there is some benefit of having advanced graphs in Stata, this makes it work slowly. In certain cases, it is better to use version 7 graphics because they help visualize the data properly without using papers or presentations. This can be seen as follows:

graph7 cccgdps popscon

Saving the dataset is a very easy command, and it is represented as follows:

Save "E:\Stata1\t1 less India pwt 80-2010.dta", replace

If we have sets of files of the same content, then the replace tab/option can be helpful. It will swap the last version and save it. If the old version is to be stored for some reason, then save it with a different name. One thing that should be kept in mind is that the original file content can be changed if it is saved with revised datasets. Therefore, after changes are made to the revised file, in order to open the file and restart it, just reopen it.

There are two ways to preserve and store the data. One option is to save the current data and revise it, and later, if you don't want to keep the data, then reopen the saved data version. Another option is to use the preserve and restore functions/commands; they will take an image of the data, and the data will come back after you type restore.