In: Statistics and Probability
What are the limitations of using a dataset when data are NOT missing at random (MNAR)? Can you still publish a paper using a dataset in this condition?
In a dataset if the data goes Missing Not At Random (MNAR) then there are two possible reasons for the missing values.
1. The missing value depends on a hypothetical value. For example, people with high salaries generally do not reveal their incomes in surveys.
2. The missing value is dependent on some other variables value. For example, underage smokers do not generally reveal their smoking habits. Here, the missing value in number of smokers variable is affected by the age variable.
The limitation In these type of scenarios is removing the observations with missing values can produce a bias in the model that is modelled using the dataset and therefore it is very uncommon to use this type of dataset to publish a paper.
If you still wish to proceed then you can opt for the following methods.
1. Try to obtain the missing data - It is possible in the data collection phase in survey like situation. It is possible to reach out to the source to obtain the data which is highly unlikely in a real world scenario.
2. Dropping the variable - If there are too many data missing for a variable (like in the case of underage smokers) then you can drop the variable altogether (like considering only the legal age group). This should be the last option and one needs to check wether the performance of the model has improved after the deletion of variable.
3. Using multiple imputation - It is the most sophisticated and widely used approach to deal with the missing value problem which involves a software creating plausible values. It is possible if you can find the correlation between the missing value variable and the related variable.
Publishing a paper using a dataset with data missing not at random is very difficult in the real world. For example if the missing data is like a medical concern then ignoring it doesn't make it go away. The best approach is to obtain the missing values by repeating the data collection phase with emphesis on the missing values.