Multinomial Logistic Regression in Stata is a statistical method used to analyze the relationship between a categorical dependent variable with three or more categories and one or more independent variables. In simple logistic regression, our dependent variable has only two outcomes. This method is particularly useful when you want to examine the effect of several variables on a single outcome that has more than two possible outcomes.
For example, you might use Multinomial Logistic Regression to examine the factors that influence a person’s decision to vote for one of three political parties, or to predict the likelihood that a patient with a certain set of symptoms will be diagnosed with one of several possible diseases.
Multinomial Logistic Regression is a powerful tool for analyzing complex relationships between variables. By following the steps outlined in this blog post, you will be able to use Stata to perform Multinomial Logistic Regression and draw meaningful conclusions from your data.
Dataset used for the Analysis
For this blog post, I will be using the “auto” dataset that comes with Stata as an example for Multinomial Logistic Regression analysis. This dataset contains information on the make, model, and performance of various cars.
To load the “auto” dataset in Stata, you can type the following command in the command window:
sysuse auto
This will load the dataset into Stata’s memory.
The “auto” dataset contains 74 observations of 12 variables, including the car’s make and model, the car’s price, the car’s weight, and the car’s miles per gallon (MPG) rating.
For the purposes of this example, I will be using the car’s MPG rating as the dependent variable and the car’s weight and price as the independent variables.
To prepare the dataset for analysis, I will first remove any missing data using the following command:
drop if missing(price, weight, mpg)
Next, I will create a categorical variable for the MPG rating using the following command:
gen mpgcat = .
replace mpgcat = 1 if mpg < 18
replace mpgcat = 2 if mpg >= 18 & mpg < 28
replace mpgcat = 3 if mpg >= 28
This will create a new variable called “mpgcat” with three categories: cars with an MPG rating less than 18, cars with an MPG rating between 18 and 28, and cars with an MPG rating greater than or equal to 28.
With the dataset cleaned and prepared, we can now move on to running the Multinomial Logistic Regression model.
What is Multinomial Logistic Regression
Multinomial Logistic Regression is a type of regression analysis used to model relationships between a categorical dependent variable with three or more categories and one or more independent variables.
The theory behind Multinomial Logistic Regression is based on the concept of maximum likelihood estimation. Maximum likelihood estimation is a statistical method that seeks to find the parameters of a model that maximize the likelihood of the observed data.
In the case of Multinomial Logistic Regression, the goal is to find the set of coefficients for the independent variables that maximizes the likelihood of observing the observed values of the dependent variable.
The Multinomial Logistic Regression model estimates the probability of each category of the dependent variable given the values of the independent variables. These probabilities are modeled using a set of logistic functions, one for each category of the dependent variable.
Stata provides several commands for running Multinomial Logistic Regression models, including “mlogit” and “clogit”. The “mlogit” command is used for unordered categorical dependent variables, while the “clogit” command is used for ordered categorical dependent variables.
Interpreting the coefficients of a Multinomial Logistic Regression model can be complex, as the coefficients represent the change in the log-odds of the probability of each category of the dependent variable associated with a one-unit change in the independent variable.
Running the Multinomial Logistic Regression Model in Stata
To specify and run a Multinomial Logistic Regression model in Stata, you can use the “mlogit” command for unordered categorical dependent variables and the “clogit” command for ordered categorical dependent variables.
Both commands require you to specify the dependent variable and the independent variables in the model. In addition, the “mlogit” command requires you to specify the base category for the dependent variable using the “base()” option.
Here is an example code for running a Multinomial Logistic Regression model with the “auto” dataset that we prepared in above dataset section, using the “mlogit” command:
gen lnprice =log(price)
mlogit mpgcat weight lnprice, base(3)
This code specifies a Multinomial Logistic Regression model with “mpgcat” as the dependent variable and “weight” and “lnprice” as the independent variables. The “base(3)” option specifies that the third category of the dependent variable (cars with an MPG rating greater than or equal to 28) is the base category.
Interpret the Multinomial Logistic Regression results
The above code will generate the following output:
When interpreting the results of a Multinomial Logistic Regression analysis, we typically look at the coefficients and odds ratios for each independent variable. These provide information about the relationship between the independent variables and the dependent variable (i.e., the probability of each category of the dependent variable).
In our example with the “auto” dataset, I ran a Multinomial Logistic Regression analysis with “mpgcat” as the dependent variable and “weight” and “lnprice” as the independent variables. The model was run using the “mlogit” command and the third category of the dependent variable (cars with an MPG rating greater than or equal to 28) was set as the base category.
The output above shows the results of a multinomial logistic regression model with the ordered categorical variable mpgcat as the dependent variable, and weight and lnprice as the independent variables. The output shows the coefficients, standard errors, z-scores, and p-values for each independent variable in each category of the dependent variable.
The first section of the output shows the iteration process of the maximum likelihood estimation algorithm. The log-likelihood function is maximized at each iteration, and the algorithm converges when the log-likelihood function stops changing significantly.
The second section of the output shows the summary statistics of the model, including the number of observations, the likelihood ratio chi-squared statistic, the p-value for the chi-squared test, the log-likelihood value of the model, and the pseudo R-squared value.
The third section of the output shows the coefficients for each independent variable in each category of the dependent variable. For example, the coefficient for weight in category “1” is 0.006, indicating that a one-unit increase in weight is associated with a 0.006 increase in the log-odds of being in category “1” versus the baseline category (category “3”). The z-score for this coefficient is 3.89, indicating that the coefficient is statistically significant at the 0.05 level. The confidence interval for this coefficient ranges from 0.003 to 0.009, indicating that we can be 95% confident that the true coefficient lies within this range.
The coefficients for lnprice and the intercept (_cons) can be interpreted in a similar way. The baseline category is “3”, which means that the coefficients for category “3” are not shown in the output.
Common diagnostics measures (e.g., residuals, deviance, and likelihood ratio tests
There are several diagnostic measures that can be used to assess the goodness of fit of a Multinomial Logistic Regression model, including residuals, deviance, and likelihood ratio tests.
Residuals are the differences between the predicted values and the observed values for each observation in the dataset. They can be used to identify any patterns or trends in the model’s predictions that may indicate problems with the model.
Deviance is a measure of the difference between the predicted values and the observed values. A smaller deviance value indicates a better fit of the model to the data.
Likelihood ratio tests compare the fit of the model to a null model that has no independent variables. A significant likelihood ratio test may indicate that the model does not fit the data well and may need to be revised.
Conclusion
In conclusion, Multinomial Logistic Regression in Stata is a powerful statistical method for analyzing categorical data with more than two categories. In this blog post, I discussed the importance of using Stata for Multinomial Logistic Regression analysis and provided a step-by-step guide for preparing the data, specifying the model, interpreting the results, and assessing the goodness of fit.
By using Stata to conduct Multinomial Logistic Regression analysis, researchers and analysts can gain valuable insights into the relationship between categorical variables and make informed decisions based on these insights. I encourage readers to use Stata for Multinomial Logistic Regression analysis and to continue exploring the many features and capabilities of this powerful statistical software package.