In this article, I interpret regression results in Stata. I have analyzed the multiple regression model using auto motive data from 1978. I have used the price of auto mobile as my dependent variable and used 7 independent variables (also called predictors) in the model. The objective of regression model is to check which factors (out of 7 independent variables) affect the price of auto mobile.
Data Description
The dataset is available in Stata software and can be accessed using following code:
sysuse auto.dta
To get the regression output, use the following code in Stata:
reg price mpg trunk weight length turn displacement foreign
Interpreting Multiple regression statistics
There are three primary parts of regression output in Stata. These are ANOVA statistics, overall fitness of model measures, and parameter estimates. I have explained all these in following paragraphs.
ANOVA Table
The Anova table presents the primary statistics about the regression model. Usually, this section is neglected in explaining the regression output because these statistics numbers are used to generate other statistics such as F-stat and R-squared, which Stata software automatically calculates for us in the regression model output.
The SS means Sum of Squares. It tells us how much variation or change this model is predicting in the dependent variable (price of motorcar). To put it more simply, how much the 7 independent variables are jointly predicting the price of motor-vehicle. The model SS shows the variance being explained by this model while residual SS shows how much variance this model is not predicting in the dependent variable.
The df means degree of freedom. The model df shows that 7 independent variables are explaining the variation in price of motorcar. However, there may by other variables which might explain the variation in price variable.
The MS means Mean Square. This is used to measure the F-statistics for the model. However, Stata automatically calculates the F-stat value for us in model fitness statistics.
Model Fitness
The model fitness measures estimate the overall goodness of fit of the multiple regression model. This tells us in clear terms that how much our model is explaining the change in dependent variable, i.e., how much these 7 independent variables affect the dependent variable.
The Number of Obs (observations) means how much observations or data records used in the model. In our case, they are 74 observations or records, which we used in the model.
F-statistics value
The F (7, 66) value is 12.73. It means the model is jointly explaining the variation in dependent variable or not. For example, we have included 7 independent variables in our model. These 7 independent variables predict or affect the price or not? This F-statistics jointly tells us the significance of these variables. We can calculate the value of F-statistics as Means Square Model value (52124262.4) divided by the residual Means Square (4093872.11). The answer is 12.73
Note: F-statistics will only tell us joint significance of independent variables. To check the individual significance of each independent variable, we must go down in parameter estimates of model which I have explained below.
Now we have calculated the F-statistics value, how will we know that this F-statistics value is significant or not. To do this, we formulate two hypotheses: null hypothesis and alternative hypothesis. The null hypothesis is that all coefficients of independent variables are equal to 0. The alternative hypothesis is that all coefficients of independent variables are NOT equal to 0. The probability value of F-stat shall answer these questions.
Probability value of F-statistics measure
The prob > F value is 0.000. The value is less than 0.05, which means it is significant at 5% level of confidence. The probability value tells us that the F-statistics value which we explained earlier (12.73), is significant at 5% level of confidence. As the probability value is significant, therefore, we reject the null hypothesis and do not reject the alternative hypothesis. It means the 7 variables’ coefficients are not equal to 0. They are truly affecting the dependent variable.
Note: There are three level of confidence which are used in statistics: 1%, 5%, and 10%. However, mostly researchers use 5% level of confidence. The level of confidence means how much you are confident in answering a particular question that this would be true. Remember, there is nothing 100% sure in statistics.
R-squared
The next measure is R-squared. This is an important number while I interpret the regression results in Stata because it clearly tells that how much your regression model is fit. This value is presented in per cent. In our case, the r-squared value is 0.5754 or 57.54%. It tells us that our 7 independent variables are affecting the dependent variable (price) up to 57.54%. There may be 42.46% other variables which we did not include in the model, and they may be relevant too in explaining the prices of cars. This measure is considered good usually when you have value between 20% to 70% or 80%. If your model’s r-squared is 100%, it means there must be a mistake by you, or your data is not correct. Because no one can be 100% sure in statistics.
Adjusted R-squared
The next item on model fitness is Adjusted R-squared. This value is a modified version of the r-squared. This adjusts the insignificant variables in the model and presents a goodness of fit. Because when you add independent variables in the model, they have some effect and value on the r-squared statistics of the model. So, the adjusted r-squared value excludes those insignificant independent variables and then calculates the r-squared. It tells a more robust regression model fitness. The lower adjusted r-square means additional variables are not adding value to the model. The higher or closed value to the r-squared means additional variables are adding value to the model.
Tip: Adjusted R-squared will always be lowered than r-squared itself.
The Root MSE is the standard deviation of the error term of regression model. It is the square root of residual MS (mean square) which I have explained earlier in ANOVA table.
Parameter Estimates
Now come to the main part of multiple regression output. This is the crux of interpretation of regression results in Stata. There are seven columns in the primary regression output, starting from price and goes to 95% confidence interval. The first is price which indicates the name of dependent variable. Below it, all the independent variables are listed. The next column is coefficient values which tells us the relationship and weight of relationship between two variables. The third column in regression output is t. This means t-statistics values. The fourth column is p-value which list downs the probability values for the independent variables. The last two columns are related to 95% confidence intervals. I will explain all these and their meanings in following paragraphs.
Regression Coefficients
Coefficients Signs
The highlighted part in above picture indicates the coefficients values for each independent variable. The positive or negative sign in values indicate the relationship nature between the two variables. For example, if there is a positive sign of coefficient value, we will say that it has a positive relationship with the dependent variable (price in our case). To put it more simply, if this variable’s value increases, the dependent variable’s value will also increase. However, if there is a negative sign of coefficient value, it means there is a negative relationship between this independent variable and dependent variable.
Coefficient Values
Apart from signs, the value of coefficient tells us that how much change (increase/ decrease) will happen in dependent variable due to this independent variable. For example, in above model, the mpg variable coefficient value is -23.8799. First, we need to check the coefficient value sign which is negative in our case. This means there is a negative relationship between mpg and price variables. When mpg increases, price decreases, and when mpg decreases, price increases. Second, the value of coefficient is 23.8799. It means if mpg increases by one unit, the price will decrease by 23.8799 per unit. If price is in dollars, it means price will be decreased by 23.8799 dollars. If mpg decreases by one unit, the price will increase by 23.8799 per unit (or dollars if data of price is in dollars).
Likewise, we can check the relationship and amount of change due to independent variables in dependent variables with the help of above statistics. However, one thing we need to keep in mind while explaining the coefficients values is p-value and t-statistics value. These are also explained below.
Standard Error
The second column is Std. Err. (Standard error). The standard error is used for two purposes. First, it is used to construct the t-statistics value which Stata automatically calculates for us in the next t column. Second, it is used to construct the confidence intervals which also Stata calculates automatically for us. These are available in the last two columns of regression output.
The standard error also tells us that there may be deviation in coefficient values. Some values may go up or down by the standard error value. In brief, the lower the standard error, it is better for the regression model.
P-values and t-statistics
The t-statistics and p-values (probability values) are key statistics to note while interpreting the regression results in Stata because they tell us that whether a particular variable is affecting the dependent variable or not, in clear Yes or No terms. If these two values are significant, then we can say that these variables are affecting the dependent variable. When these would be significant, their coefficients value would also be significant. However, if t-statistics and p-values are insignificant, then coefficient values would have no meaning in the model. This would mean this variable is not affecting the dependent variable.
Rule of t-statistics and p-value
The rule for t-statistics and p-values are quite simple. As I explained earlier, there are three levels of confidence in statistics for acceptance or rejection of any hypothesis. At 5% level of confidence, if t-statistics value is greater than 1.96, then it is significant. If it is lower than 1.96, then it is insignificant.
Tip: We ignore the positive or negative sign while determining the t-statistics is significant or not. Therefore, ignore negative signs and then judge whether it is lower or great than 1.96
The rule for p-value (probability value) is that at 5% level of confidence, it should be lower than 0.05 or 5%. If it is greater than 0.05 or 5%, then it would be insignificant. The p-value ranges from 0 to 100% or 0.100.
Model’s t-statistics
When we look into our above model, we can see that only two variables are significant. These are weight and foreign. The weight variable’s t-statistics value is 4.03 which is greater than 1.96, thus it is significant. The foreign t-statistics value is 4.99 which is also greater than 1.96, thus it is also significant. All other variables’ t-statistics values are lower than 1.96 (ignoring the negative sign), therefore, they are insignificant.
Model’s p-values
The significance or insignificant of variables is further verified through p-values. The p-value of weight variable is 0.000 which is less than 0.005, thus indicating the significance of weight variable. The foreign variable’s p-value is also 0.000 which is lower than 0, therefore, it is also significant variable.
Now, to sum up the whole model, we come to know that our two variables are significantly affecting the price variable. While all other five variables are insignificant, and they do not affect the price variable.
Tip: When a variable becomes insignificant, its coefficient value has no meaning.
How much independent variables are affecting the dependent variable?
Now, the last question while interpreting the regression output in Stata, how much weight and foreign variable affecting the price variable? This will tell the coefficient values of these two variables. The coefficient value of weight variable is 4.9116. The sign is positive, so there would be a positive relationship between these two. When weight increases by 1 unit, the price would increase by 4.9116 per unit (or dollar if price is in dollars). If weight decreases by 1 unit, the price will also decrease by 4.9116 per unit (or dollar if price is in dollars). This is the crux of regression model.
Similarly, foreign variable coefficient value is 3518.809. It also has a positive sign, therefore, a positive relationship between foreign and price variables. It means when a vehicle is foreign, its price would increase by 3518.809 per unit of price (or dollars if price is given in dollars). If a vehicle is not foreign, meaning that it is domestic, the vehicle price would decrease by 3518.809 per unit or price (or dollars if price is given in dollars).
95% Confidence Interval
As I explained earlier, the confidence interval is construed with the mixture of coefficient value and standard error. This is a minimum and maximum confidence interval level. The final value of variable may be up or down by this interval. Remember, we cannot be 100% sure in statistics, so there would always be room for deviation of final value. This interval guides us through that, the final value may be up or down with this interval.
The weight variable lower value is 2.477 and upper value is 7.345. It tells us that weight variable’s coefficient value 4.911 may be deviate by these two lower and upper values. Likewise, the foreign variable lower value is 2111.311 and upper value is 4926.306. This also tells us that for some variables the effect of foreign variable may be up or down by these two upper and lower values. The other variables’ values are not relevant because they are insignificant, so their coefficient values do not matter, and similarly their 95% confidence interval values also do not matter in the regression model.
Could not Understand your research Project?
Let’s discuss it with me. I have tutored more than 400+ students in understanding their projects and their grades were 100% better.
Conclusion
This article sums up the interpretation of regression model in Stata. I have presented a stepwise example of multiple regression model and interpret the statistics. There are numerous things to look from regression output. The R-squared, Adjusted R-squared, t-statistics and p-values, and lastly coefficients values are main things to look while interpreting the regression results. These values will tell you the whole story of dependent and independent variables relationship.
Nice post. I learn something new and challenging on sites I stumbleupon everyday. Its always interesting to read content from other writers and practice a little something from their sites.