Question

In: Statistics and Probability

Consider the prostate dataset containing data from a study on 97 men with prostate cancer. You...

Consider the prostate dataset containing data from a study on 97 men with prostate cancer. You will have to install the 'faraway' package and use the 'prostate' dataset for Questions 1-5. Description of dataset Lcavol - log(cancer volume) Lweight - log(prostate weight) Age - age Lbph - log(benign prostatic hyperplasia amount) Svi - seminal vesicle invasion Lcp - log(capsular penetration) Gleason - Gleason score Pgg45 - percentage Gleason scores 4 or 5 Lpsa - log(prostate specific antigen) Build a KNN regression model with Lpsa as the response variable and all other variables as predictors. For this use knn.reg function from the "FNN" package. We want to accurately estimate how well our KNN regression model is performing. For this, we will divide our dataset into training and test sets. Set seed to 42 in the beginning. Create a training set of size 70 - using sample() function and rest of the observations would go into the test set.

1)

Estimate the Mean Squared error on test set for the model that runs linear regression based(lpsa as response and all others as predictor variables) on three nearest neighbours for prostate data. Consider this as Model 1. Set seed to 42 in the beginning Please Use the knn.reg function from FNN package to calculate this

2)

You are unhappy with high MSE and want to test how k=5 model performs What is the Mean Squared error on test set for 5 Nearest Neighbour regression model? [consider this as Model 2] Set seed to 42 in the beginning

Solutions

Expert Solution

code


Related Solutions

1. The dataset prostate (in R package ”faraway”) is from a study on 97 men with...
1. The dataset prostate (in R package ”faraway”) is from a study on 97 men with prostatecancer who were due to receive a radical prostatectomy.Fit a model withlpsa(y) as the response variable andlcavol(x) as the predictor andanswer the following question: •Calculate and plot the 90%confidenceandpredictionbands. Which type ofintervals are wider?
a study of prostate cancer was initiated in Des Moines, Iowa. A total of 1,000 men,...
a study of prostate cancer was initiated in Des Moines, Iowa. A total of 1,000 men, 55-64 years of age, with no prior evidence of prostate cancer were enrolled in the study. Each year during the study, the men being observed were examined and tested for the presence of prostate cancer. The results of the annual exam revealed: 10 cases confirmed at the 1st exam, 15 additional cases at 2nd exam, 20 additional cases at 3rd exam, 25 additional cases...
DISCUSS WHY THE INCIDENCE OF PROSTATE CANCER TURNS TO BE HIGHER AMONG THE AFRICAN AMERICANS MEN...
DISCUSS WHY THE INCIDENCE OF PROSTATE CANCER TURNS TO BE HIGHER AMONG THE AFRICAN AMERICANS MEN THAN THE OTHER ETHNICITY, AND INCLUDE STATEMENTS OF THE PROBLEMS
a discussion on prostate cancer, how men over 45 should get tested. what is something interesting...
a discussion on prostate cancer, how men over 45 should get tested. what is something interesting or something to agree with that process?
In relation of efficacy of screening for various conditions. Apart from prostate cancer, are there any...
In relation of efficacy of screening for various conditions. Apart from prostate cancer, are there any other conditions for which the benefits of screening may not outweigh the risks? If you have been involved with any screening or surveillance initiatives, you are encouraged to share your experiences and perspectives. Subject is public health
R-Studio; Statistics The data set in the table considers information on the spread of prostate cancer...
R-Studio; Statistics The data set in the table considers information on the spread of prostate cancer to the lymph nodes for 53 patients. For a sample of prostate cancer patients, a set of possible predictor variables were measured before surgery to determine if the lymph nodes were compromised. Subsequently, the patient underwent surgery and the status of his lymph nodes was determined. The data set contains 53 observations of 7 variables: id: identifiers for each subject in the study. ssln:...
In a study comparing two treatment options for lung cancer, men with stage 2 lung cancer...
In a study comparing two treatment options for lung cancer, men with stage 2 lung cancer are randomized into two groups to receive either a new immunotherapy or traditional chemotherapy. These men are then followed for 1 year to determine if there are any differences in lung cancer survival rate into two groups. What kind of study is this? EPIDEMILOLOGY
A dataset with filename ‘tv.txt’. The data arise from a study examining the time teenagers spend...
A dataset with filename ‘tv.txt’. The data arise from a study examining the time teenagers spend watching TV. A random sample of n = 100 eighth grade American high school students was obtained, and the number of minutes spend watching TV during the first week of October was recorded. A similar sample of m = 90 Canadian students was also obtained. In this study it is of interest to compare the TV watching habits of the teenagers from the two...
Lung cancer is the second most common cancer among men and women in the USA. You...
Lung cancer is the second most common cancer among men and women in the USA. You are hired as a manager-planner by one of the hospitals. The organization is interested in establishing a community outreach program promoting smoking cessation and lung cancer screening. The Board of Directors requested your services in processing the data collected through the prospective cohort study among male and female population visiting local gym. The data set is attached. Please use Excel or Epi Info to...
In R: Consider dataset “juul” from library “ISwR”. (juul is a built in data set) Are...
In R: Consider dataset “juul” from library “ISwR”. (juul is a built in data set) Are the means of igf1 equal among tanner groups at 5% level? Please use the six step process to test statistical hypotheses for this research problem. Note: You need to convert tanner from numeric to factor type and ignore all the NAs.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT