In: Statistics and Probability
5. Suppose that a data scientist have 200 observations, 300
input variables, and a categorical output
variable. Here is his/her analysis procedure:
Step 1. Find a small subset of good predictors that show fairly
strong (univariate) association with
the output variable.
Step 2. Using the input variables selected in Step 1, build LDA
(linear discriminant analysis) and
QDA (quadratic discriminant analysis) models.
Step 3. Perform 5-fold cross-validation for both LDA and QDA models
with input variables selected
in Step 1.
Step 4. Select a model with a smaller CV estimate.
Step 5. To predict new observations using the model selected in
Step 4, obtain 5 predicted values
from 5 models estimated in 5-fold CV of Step 3. And then, classify
new observations by
majority vote rule.
(1) In fact, the data scientist made two mistakes in this
prediction problem. Indicate the
mistakes and explain why those mistakes are serious problems.
(2) Give a right procedure by correcting the mistakes found in part
(1).
(just conceptual question)
(1) First mistake is :
Selecting set of predictors that show fairly strong association with output variable. First check if the independent variables are independent of each other and there are no interactions within them.Higher correlation/association could be even be caused by random noise. So, discarding variables based on correlation with dependent variable is not a good idea.
Correction :
We can choose other methods to drop variabes. Initial Data Quality checks such as Missing values, same values, etc can help us get rid of redundant variables. Next step is to check Multicollinearity of variables. We cannot take variables with high VIF forward for model building.We can further use methods like PCA, Stepwise Selection, Decision Trees for selection of variables.
Second Mistake:
To predict new observations using the model selected in Step 4,
obtain 5 predicted values
from 5 models estimated in 5-fold CV of Step 3. And then, classify
new observations by
majority vote rule.
cross-validation is a technique for estimating the how well our model will perform on unseen data(validation dataset), rather than model buidling itself. We don't use these 5 instances of our trained model to do any real prediction.We have to use whole data to make prediction.
Correction:
Hence, we choose the algorithm(LDA or QDA) which produced the best accuracy averaged over all 5 folds, not only the one with lowest CV.Now that you have chosen the algorithm, you can train it using your whole training data and start making predictions.