In this blog, I have explained a step-by-step approach to change string variable to numeric variable in Stata. Actually, Stata needs numerical data to process it and do the statistical analysis. However, sometimes, we come up with the data which has non-numeric characters in it. This guide will explain how to deal with such situations.
There may be three broad scenarios with your data in Stata:
- Your data is in numeric form; however, it is coded as string type in Stata
- Your data is in numeric form, however, some of the cells contain non-numeric characters
- Your data fully contains string characters
Example Data Description
I have generated the above data to simplify things for you. You may be encountered for these string variable issues in your data. The red data in Stata means data is not in character (integer or float). In above data, we can see that id variable has all numeric characters but still it is red. This means that the variable type is string.
In gender and country variables, we have two types of characters in it. Likewise, in income variable, all of the data is numeric, but there is one string character “X” in it. In math, all of the data is numeric but still it is showing as string. However, in physics, there is one missing value and one “.” in it. So, how to change this data which is useable for data analysis?
We can also check the type of variables using following command:
See the storage type column in front of each variable. All variables are string variables.
There are two methods to change string variables to numeric variable in Stata. One is using destring command and other is real() command. We will first explain destring method and at the end, we will also explain the real() command.
Solution: Data is numeric but coded as string type
If your data is in numeric form but it still shows as read while you browse the data. By the way, you can also explore the original data in your Stata window using list command as follows:
list in 1/10
This will show the first 10 rows in your Stata window. We can see that id and math variables are coded in numeric form, but they still appear as string. To change them into numeric form, use following command:
The above message will appear in the Stata window. It shows the progress of converting variables into numeric form. We can see that three variables status show that “all characters numeric’ replaced as byte”. This means that variables are successfully changed to numeric form.
Note: There are two commands to use with destring. One is replace and other is generate. If you use generate command with destring, it will generate new variables. When you use replace command, it replace the existing variables to numeric form.
We can also confirm from browser window that three variables have been marked as black colored.
Solution: contains nonnumeric characters; no replace
We usually encountered this error that data contains non-numeric characters; no replace. To overcome this type of error, we use the encode command in Stata. As in our dataset (shown above), gender and country are still string variables. (The income is also string, but we need other method to correct this variable)
encode in Stata
To change gender and country string variables into numeric variables, use following command, one by one:
encode gender, generate (gender2) encode country, generate (country2)
This will generate two new variables. As we can see above, the values of these variables have changed to blue. Actually, the encode command converts the string values into label and automatically assign numeric numbers to labels.
We can confirm this by using nolable command with list as shown following:
list in 1/10, nolabel
The nolabel command will not show the labels and only show the numbers. Another thing to note is that encode command requires to generate new variables. Therefore, the existing variables also exist in above table.
Remove characters from string
This is the last scenario which we can encounter with in Stata. We have all data in numeric form; however, some characters are in string. In below picture, we can see that only income variable is string, rest all have been changed or converted to numeric form due to “X” in income variable. So, how to remove it?
We again need destring command to remove this character. Use following command:
destring income, replace ignore (X)
Now, you can see that “X” has been replaced by the missing value and whole variable has been converted to numeric form. The above command replaces the string character with missing value. In this way, we can change the characters in a string variable and convert it to numeric form.
real() command to change string variable to numeric:
We can also use the real() command to convert string variable to numeric variable. However, this option is not much popular among the researchers and data analysts. Because we must use it for each variable (We can convert all variables with string in single command). Further, it only converts those string variables which are coded in numeric form but stored as string type. To convert a string variable to numeric variable, use following command using real() command:
generate id_new = real(id)
There are three types of errors which we encounter with in Stata related to string variables. First, data is in numeric form, but variable is stored as string. Second, data is stored as string in Stata, meaning that actually data is in character form, not numeric form. Third, data is mix of numeric and characters. We can use destring command which is the best command to deal with this issue. We can also use the real() command; however, it is not much popular due to easiness of use of destring command.