The descriptive statistics are the first thing to explore while doing the data analysis in Stata. They provide an overall overview of the data. There are multiple commands in Stata to do the descriptive statistics. This comprehensive guide explains all of the descriptive commands in Stata using an example dataset, for both quantitative and qualitative variables.
I have used the life expectancy example data set from Stata. The data contains life expectancy rate, GNP per capita, and other related variables. The above picture shows the browser view of data in Stata. To use this dataset, use the following command in Stata:
To overall describe the data and variables, we use the describe command in Stata:
Note: When you simply use describe, it will describe the whole dataset. If you write a specific variable name with the command, it will show the statistics related to that specific variable only.
The above table shows the basic information about the variables and data. The first highlighted part shows the number of observations (obs) which are 68 in our case. The vars means number of variables in data, which are 6 in above data. Next is size which tells the overall size of data (2,652 in our case). In right highlighted corner, the data name appears, which is Life expectancy 1998. This shows that data is from 1998.
In below columns, further information about each individual variable is explained. First column shows the variable name. The second column is related to variable type. All variables are float or byte except country variable which is a string variable. Stata is not as good to process string variables. Third column shows the display format of data. Fourth column is value label. Fifth column explains the variable. This explanation is for humans only, Stata does not need it to process data.
Another command is used for descriptive statistics is summarize. This command gives the overall statistics such as mean, standard deviation, minimum, and maximum values of each variable. One can use this command in all types of data; however, the sum command is best for continuous nature of data. To get the summary statistics use following command:
This command can also be use as sum. This is a short version of summarize.
Note: As explained earlier, when we use simple command name such as summarize, it will take into account the whole dataset and show whole data statistics. We can also write names of variables with the command to get the statistics only related to those variables.
The above table gives a comprehensive look about the variables in data. The country variable has 0 observations as shown in above picture. This means the country variable is a string variable and Stata does not process string variable, except in few cases. The string variable type can also be seen in above describe command section.
The above table shows 5 different columns against each variable. First column is Observations (Obs). This tells us how much observations for each variable we have in the data. Second column shows the mean value of each column. This column tells the average value for each variable in the data. Third column shows the Standard deviation of each variable. This is a valuable insight which tells us that final value of individual observation in the data may be varied by the value of standard deviation. This is also used to construct the range value. Fourth column shows the minimum value. Finally, fifth column shows the maximum value for each variable.
How to describe a variable in Stata?
The above table in summarize data section help us to describe a variable in Stata. The region variable shows that there are 68 observations. The mean value is 1.5 and standard deviation value is 0.7431277. The minimum value is 1 and maximum value is 3.
Note: One thing to note is that region is a discrete variable. I also told earlier that sum command is best to explain the continuous type of variables. We can use sum command for discrete variables too, but it does not make any sense as in above case of region variable. For discrete variables, tab command is best which I will explain below.
The population growth (popgrowth) has 68 observations. The mean value for this variable is 0.9720588% and standard deviation is 0.9311918%. The minimum value in data for population growth is -0.5% and maximum value is 3% growth in sample.
The life expectancy (lexp) also has 68 observations in the sample. The mean value is 72.27941 with standard deviation value of 4.715315. The minimum value is 54 and maximum value is 79.
The GNP per capita (gnppc) has 63 observations in the data. Other variables have 68 observations while this variable has 63 observations. This means there are missing values in GNP per capita variable. The mean value is 8674.857. The standard deviation value is 10634.68. The minimum and maximum values are 370 and 39980, respectively.
Finally, the safewater variable has only 40 observations. This indicates that there are also missing values for this variable. The mean value is 76.1 with a standard deviation of 17.89112. The minimum and maximum values are 28 and 100, respectively.
Summarize Data in more detail
There is another short command which is used with summarize or sum command to describe each variable in more depth. We just need to put a comma and d at the end like following:
sum lexp, d
This table shows the in-depth descriptive analysis in Stata for a single variable. I have used the life expectancy variable in above case.
The table shows values of variable for different percentiles points such as 1%, 5%,10% to 95% and 99%. The next column shows the smallest and largest value in each percentile group. The next part shows the observations (obs). The mean shows the average value with standard deviation next to it. Finally, variance, skewness, and kurtosis values are also available for this variable.
Frequency Distribution for Discrete variables
If your variable is discrete in nature, like gender, country, region etc., then you should do frequency distribution analysis in Stata. This will tell you the frequency and percentage of each unit in the data. To get the frequency distribution table, tab command is used:
The above table shows 3 things about the region variable. In first column, it shows the frequency for each component in the variable. For example, Europe and Asia component has 44 observations in the region variable. The next column shows the percentage of each component. For example, Europe and Asia component has 64.71% observations in the region variable. Last column is related to cumulative percentage.
The N.A. means North America. It has 14 observations which consists of 20.59%. The cumulative percentage till second component is 85.29%. The S.A. means South America. This has 10 observations, a total of 14.71%. The cumulative percentage till this point is 100%.
This frequency distribution table shows that Europe and Asia have the highest observations. Then North America is at second number. Finally, North America has lowest observations in the sample. This gives a holistic view about the discrete variables in Stata.
Get Customized Descriptive Statistics through tabstat command
Up till this point, all the Stata commands give the descriptive statistics which are already fixed. This means you do not have flexibility to get the results as per your need and requirements. The tabstat command is a useful command which gives the flexibility to show statistics which you want only. Following are different options which we can use with tabstat:
tabstat popgrowth lexp gnppc, stat(mean sd min max)
The stat option after comma will give you flexibility to include which descriptive statistics measure you want to include the summary table. For example, in above table, I have used 4 measures such as mean, standard deviation (sd), minimum, and maximum.
Following is a full list of all statistics which we can get with tabstat command. The structure will be the same as I have shown above, only name of measure will change.
|1.||mean||Get the average value (mean)|
|2.||n||Count the number of observations|
|3.||sum||Get the total of each variable|
|7.||sd||Gives Standard deviation|
|9.||cv||Get the Coefficient of variation (Formula = sd/mean)|
|12.||p1||1st percentile value|
|13.||p5||5th percentile value|
|14.||p10||10th percentile value|
|15.||p25||25th percentile value|
|16.||median||50th percentile value [also called median value]|
|17.||p50||Same value as median|
|18.||p75||75th percentile value|
|19.||p90||90th percentile value|
|20.||p95||95th percentile value|
|21.||p99||99th percentile value|
|22.||iqr||Interquartile range (Formula = p75 – p25)|
For example, to get the mean, 25th percentile value, median value, interquartile range value, and number of observations, we can use the following structure and commands:
tabstat popgrowth lexp gnppc, stat(mean p25 median iqr range n)
Convert Rows to Columns
We can also convert the rows into columns in above table using the tabstat command. We just need to put the col(stat) command at the end like following:
tabstat popgrowth lexp gnppc, stat(mean p25 median iqr range n) col(stat)
How to get descriptive statistics by group in Stata?
We can also segregate data by group and get the descriptive statistics. For example, in our dataset, there are three regions as shows above. To get the descriptive statistics by group, use the following command in Stata:
tabstat popgrowth lexp gnppc, by(region) stat(mean p25 median iqr range n) col(stat)
The high quality table shows the descriptive statistics group wise. There are three regions, and all values are separated by these three regions and their total is also given at the end. Therefore, for each individual variable, we have the individual statistics for each group in our data. Beautiful?
Show original observations using list command
The list command is a useful command to inspect the original observations of the data. For example, to check the first 10 rows of a variable, use the following command:
list popgrowth lexp in 1/10
The in 1/10 command at the end will show only first 10 rows. You also edit it to get the rows as per your requirement.
There is another useful command to align and filter the data. The sort command is used to sort the data:
It will sort the data using region variable.
Summarize Data using conditions
We can also summarize the data using if conditions. For example, we can get the summary statistics only when a particular condition is met. In our case, we can summarize the data for only North America region:
sum popgrowth lexp gnppc if region ==2
Note: North America is coded as 2 in our dataset. Therefore, we used region == 2 in above code.
Now, in above table, we see only results of North America region. This is a useful technique when we want to see a particular summary statistic of our data.
Convert Stata Tables into MS Word Format
Now, we have done the descriptive statistics above, we can also convert these tables in MS word format. Although, this is not a built-in function in Stata, thus, we need other packages to perform this task. One of the packages is asdoc. This is an easy to use package which convert high quality descriptive statistics tables from Stata to MS Word.
To install this package, use following code
ssc install asdoc
if you have already this package in your Stata software, then skip this code.
After installing the package, you just need to put asdoc before every command in Stata. It will automatically convert your tables into MS word format. A link to file will also show in Stata like below:
asdoc sum popgrowth lexp gnppc if region ==2
The table is MS Word will look like following:
Isn’t it beautiful? It is 🙂
Visualization of Data
After descriptive tables in Stata, now come to the visualization of variables in the Stata. This is a good practice to include the visualization charts in the descriptive analysis part.
Pie Charts in Stata
Pie charts are useful for discrete type of variables. They show the beautiful presentation of discrete variables such as gender, region, and country variable.
graph pie, over(region)
The pie chart shows visually the region variable. Each component and its share are visualized in beautiful format. This shows that Europe and Asia has the highest values, then North America, and then South America.
Bar charts in Stata
Bar charts are also a form of visualizing the variables. To get the bar charts; use following:
graph bar, over( lexp )
This bar chart shows that 67, 69, 73, and 78 life expectancy values have the higher values in the data. This gives an overall appealing about the life expectancy variable.
To conclude, the descriptive statistics is a great way to generally inspect the data. It gives a holistic view about your dataset. Stata gives the flexibility to do the summary analysis. Customized packages are also available in Stata which gives extra functionality to convert Stata tables into MS Word format. Finally, visualizations such as pie chart, bar chart, and scatter plots, give a beautiful look to your data analysis.