Question

In: Statistics and Probability

What are the limitations of using a dataset when data are NOT missing at random (MNAR)?...

What are the limitations of using a dataset when data are NOT missing at random (MNAR)? Can you still publish a paper using a dataset in this condition?

Solutions

Expert Solution

In a dataset if the data goes Missing Not At Random (MNAR) then there are two possible reasons for the missing values.

1. The missing value depends on a hypothetical value. For example, people with high salaries generally do not reveal their incomes in surveys.

2. The missing value is dependent on some other variables value. For example, underage smokers do not generally reveal their smoking habits. Here, the missing value in number of smokers variable is affected by the age variable.

The limitation In these type of scenarios is removing the observations with missing values can produce a bias in the model that is modelled using the dataset and therefore it is very uncommon to use this type of dataset to publish a paper.

If you still wish to proceed then you can opt for the following methods.

1. Try to obtain the missing data - It is possible in the data collection phase in survey like situation. It is possible to reach out to the source to obtain the data which is highly unlikely in a real world scenario.

2. Dropping the variable - If there are too many data missing for a variable (like in the case of underage smokers) then you can drop the variable altogether (like considering only the legal age group). This should be the last option and one needs to check wether the performance of the model has improved after the deletion of variable.

3. Using multiple imputation - It is the most sophisticated and widely used approach to deal with the missing value problem which involves a software creating plausible values. It is possible if you can find the correlation between the missing value variable and the related variable.

Publishing a paper using a dataset with data missing not at random is very difficult in the real world. For example if the missing data is like a medical concern then ignoring it doesn't make it go away. The best approach is to obtain the missing values by repeating the data collection phase with emphesis on the missing values.


Related Solutions

What are the two limitations of correlation when interpreting the data?
What are the two limitations of correlation when interpreting the data?
What are some limitations of the dataset recidivism of released prosioner in 1994?
What are some limitations of the dataset recidivism of released prosioner in 1994?
Minimizing missing data: Here are some types of missing data that you might encounter when implementing...
Minimizing missing data: Here are some types of missing data that you might encounter when implementing a clinical trial. Pick two, and briefly describe a study procedure you could use to minimize the chance of that type of missing data occurring. 1. A participant does not show up for a study visit. 2. A participant does not bring important information (for example, a list of current medications or a pain diary that was supposed to be filled out). 3. Inadequate...
What are the limitations of using reverse HPLC?
What are the limitations of using reverse HPLC?
Analyze used car inventory dataset using Python's pandas library - using DataFrame data structure¶ Dataset: UsedCarInventory_Assignment1.txt...
Analyze used car inventory dataset using Python's pandas library - using DataFrame data structure¶ Dataset: UsedCarInventory_Assignment1.txt (available on Canvas) This dataset shows used cars available for sale at a dealership. Each row represents a car record and columns tell information about each car. The first row in the dataset contains column headers. You must use Pandas to complete all 10 tasks.
Suppose that for a dataset the mean is known. Using the 25 random samples, we computed...
Suppose that for a dataset the mean is known. Using the 25 random samples, we computed the sample variance as s^2=0.001. a) Does the data support the claim that the true standard deviation is less than 0.05? (use alpha = 0.05 and alternative hypothesis sigma^2 < 0.0025) b) Compute a two-sided 95% confidence interval for the true variance of the data.
What are some of the limitations of using reinforcement and punishment
What are some of the limitations of using reinforcement and punishment
What are some disadvantages of using the DSM? What are some of the limitations of the...
What are some disadvantages of using the DSM? What are some of the limitations of the DSM? What is the construct of "shadow syndromes" and how does this relate to the DSM?
What are the limitations of using CPI as a measure of the inflation rate?
What are the limitations of using CPI as a measure of the inflation rate?
What are the minimum and maximum of the following dataset? Normalize these data so that the...
What are the minimum and maximum of the following dataset? Normalize these data so that the normalized data have a minimum of 0 and maximum of 1. •   5.0, 5.5, 6.2, –4.8, 7.2, 5.4
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT