In: Statistics and Probability
In dealing with large data sets, addressing missing values is an important step. But, some datasets contain variables that have a large amount of missing values. In other words, several rows of the dataset have missing values. In such cases, dropping the variable with missing values will lead to a loss of significant data. Imputing the missing values might also be useless, as these imputations will be based on a small number of records. In such cases, what alternatives can you suggest when modeling from such data?
I have cosidered the missing values data in clinical research:-
The best possible method of handling the missing data is to prevent the problem by well-planning the study and collecting the data carefully. The following are suggested to minimize the amount of missing data in the clinical research:
First, the study design should limit the collection of data to those who are participating in the study. This can be achieved by minimizing the number of follow-up visits, collecting only the essential information at each visit, and developing the userfriendly case-report forms.
Second, before the beginning of the clinical research, a detailed documentation of the study should be developed in the form of the manual of operations, which includes the methods to screen the participants, protocol to train the investigators and participants, methods to communicate between the investigators or between the investigators and participants, implementation of the treatment, and procedure to collect, enter, and edit data.
Third, before the start of the participant enrollment, a training should be conducted to instruct all personnel related to the study on all aspects of the study.
Fourth, if a small pilot study is performed before the start of the main trial, it may help to identify the unexpected problems which are likely to occur during the study, thus reducing the amount of missing data.
Fifth, the study management team should set a priori targets for the unacceptable level of missing data. With these targets in mind, the data collection at each site should be monitored and reported in as close to real-time as possible during the course of the study.
Finally, if a patient decides to withdraw from the follow-up, the reasons for the withdrawal should be recorded for the subsequent analysis in the interpretation of the results.
It is not uncommon to have a considerable amount of missing data in a study. One technique of handling the missing data is to use the data analysis methods which are robust to the problems caused by the missing data. An analysis method is considered robust to the missing data when there is confidence that mild to moderate violations of the assumptions will produce little to no bias or distortion in the conclusions drawn on the population. However, it is not always possible to use such techniques. Therefore, a number of alternative ways of handling the missing data has been developed.