Question

In: Statistics and Probability

5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output...

5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output
variable. Here is his/her analysis procedure:
Step 1. Find a small subset of good predictors that show fairly strong (univariate) association with
the output variable.
Step 2. Using the input variables selected in Step 1, build LDA (linear discriminant analysis) and
QDA (quadratic discriminant analysis) models.
Step 3. Perform 5-fold cross-validation for both LDA and QDA models with input variables selected
in Step 1.
Step 4. Select a model with a smaller CV estimate.
Step 5. To predict new observations using the model selected in Step 4, obtain 5 predicted values
from 5 models estimated in 5-fold CV of Step 3. And then, classify new observations by
majority vote rule.
(1) In fact, the data scientist made two mistakes in this prediction problem. Indicate the
mistakes and explain why those mistakes are serious problems.
(2) Give a right procedure by correcting the mistakes found in part (1).

(just conceptual question)

Solutions

Expert Solution

(1) First mistake is :

Selecting set of predictors that show fairly strong association with output variable. First check if the independent variables are independent of each other and there are no interactions within them.Higher correlation/association could be even be caused by random noise. So, discarding variables based on correlation with dependent variable is not a good idea.

Correction :

We can choose other methods to drop variabes. Initial Data Quality checks such as Missing values, same values, etc can help us get rid of redundant variables. Next step is to check Multicollinearity of variables. We cannot take variables with high VIF forward for model building.We can further use methods like PCA, Stepwise Selection, Decision Trees for selection of variables.

Second Mistake:

To predict new observations using the model selected in Step 4, obtain 5 predicted values
from 5 models estimated in 5-fold CV of Step 3. And then, classify new observations by
majority vote rule.

cross-validation is a technique for estimating the how well our model will perform on unseen data(validation dataset), rather than model buidling itself. We don't use these 5 instances of our trained model to do any real prediction.We have to use whole data to make prediction.

Correction:

Hence, we choose the algorithm(LDA or QDA) which produced the best accuracy averaged over all 5 folds, not only the one with lowest CV.Now that you have chosen the algorithm, you can train it using your whole training data and start making predictions.


Related Solutions

5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output...
5. Suppose that a data scientist have 200 observations, 300 input variables, and a categorical output variable. Here is his/her analysis procedure: Step 1. Find a small subset of good predictors that show fairly strong (univariate) association with the output variable. Step 2. Using the input variables selected in Step 1, build LDA (linear discriminant analysis) and QDA (quadratic discriminant analysis) models. Step 3. Perform 5-fold cross-validation for both LDA and QDA models with input variables selected in Step 1....
1. For a data set with 3 variables and 3 observations, suppose Xbar, the sample mean...
1. For a data set with 3 variables and 3 observations, suppose Xbar, the sample mean vector is [5, 3, 4]’. Let b’ = (1 1 1) and c’ = (1 2 -3). The sample covariance matrix is given as, S = ( 13 −3.5 1.5;  −3.5 1 −1.5 ; 1.5 −1.5 3 ) (a) Find the sample mean and variance for b’X and c’X. (b) Find the sample mean and variance for c’X. (c) Find the covariance between b’X and...
Consider the dataset regarding the drop in light percent output (output) and two input variables –...
Consider the dataset regarding the drop in light percent output (output) and two input variables – bulb surface (qualitative) and length of the operation hours (quantitative). The data is available in a sheet named ‘Problem 3’. Answer the following. (a) Write down the overall model form if one wishes to build a second order model for each value of the qualitative variable (c) Build a regression model showing the 90% confidence ranges of the regression parameters. Write down the mean...
data set will need at least four variables - at least two categorical and at least...
data set will need at least four variables - at least two categorical and at least two quantitative. For example, you might consider the following variables for American participants in a survey: birth month (categorical), state of birth (categorical), average number of bowls of cereal eaten per week (quantitative), and amount spent on groceries (quantitative). (a) First, formulate a research question relating to two of your quantitative variables along the lines of "how does *quantitative variable 1* relate to *quantitative...
what is input, transformation process and output of hospital? ( answer should 300 words)
what is input, transformation process and output of hospital? ( answer should 300 words)
Suppose you are performing a test of independence for 2 categorical variables with alpha=0.05. The row...
Suppose you are performing a test of independence for 2 categorical variables with alpha=0.05. The row variable has 4 categories and the column variable has 5 categories. What is the critical value?
Modify the following code to make the input and output look like this. Input 5 Vader...
Modify the following code to make the input and output look like this. Input 5 Vader 1300 Kirk 1250 Adama 1000 Reynolds 1615 Oneill 1470 Output Enter the number of candidates: 5 Enter candidate's name :Vader 1300 Enter votes received :Enter candidate's name :Kirk 1250 Enter votes received :Enter candidate's name :Adama 1000 Enter votes received :Enter candidate's name :Reynolds 1615 Enter votes received :Enter candidate's name :Oneill 1470 Enter votes received : Name Votes Percentage Vader 1300.00 19.59% Kirk...
Question 2 Dummy variables can be used to represent categorical data ___ a) only when the...
Question 2 Dummy variables can be used to represent categorical data ___ a) only when the categorical data is used as a response variable b) only when the categorical data is used as an explanatory variable c) when the categorical is used as either the response or explanatory variable d) Dummy variables can never be used to represent categorical data Question 3 Consider the following OLS regression equation: predicted y = b0 + b1X1 + b2d. The "X1" refers to...
Consider a relational dataset and specify your input and output variables, then: Train the model using...
Consider a relational dataset and specify your input and output variables, then: Train the model using 80% of this dataset and suggest an appropriate GLM model output to input variables. Specify the significant variables on the output variable at the level of ?=0.05 and explore the related hypotheses test. Estimate the parameters of your model. Predict the output of the test dataset using the trained model. Provide the functional form of the optimal predictive model. Provide the confusion matrix and...
For the following system provide the input and output variables. Justify. *One Gadget in your home...
For the following system provide the input and output variables. Justify. *One Gadget in your home *An example of a socio-economic system *One system that you are using as a part of your work
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT