In: Statistics and Probability
Consider the prostate dataset containing data from a study on 97 men with prostate cancer. You will have to install the 'faraway' package and use the 'prostate' dataset for Questions 1-5. Description of dataset Lcavol - log(cancer volume) Lweight - log(prostate weight) Age - age Lbph - log(benign prostatic hyperplasia amount) Svi - seminal vesicle invasion Lcp - log(capsular penetration) Gleason - Gleason score Pgg45 - percentage Gleason scores 4 or 5 Lpsa - log(prostate specific antigen) Build a KNN regression model with Lpsa as the response variable and all other variables as predictors. For this use knn.reg function from the "FNN" package. We want to accurately estimate how well our KNN regression model is performing. For this, we will divide our dataset into training and test sets. Set seed to 42 in the beginning. Create a training set of size 70 - using sample() function and rest of the observations would go into the test set.
1)
Estimate the Mean Squared error on test set for the model that runs linear regression based(lpsa as response and all others as predictor variables) on three nearest neighbours for prostate data. Consider this as Model 1. Set seed to 42 in the beginning Please Use the knn.reg function from FNN package to calculate this
2)
You are unhappy with high MSE and want to test how k=5 model performs What is the Mean Squared error on test set for 5 Nearest Neighbour regression model? [consider this as Model 2] Set seed to 42 in the beginning