In: Statistics and Probability
R-Studio; Statistics The data set in the table considers information on the spread of prostate cancer to the lymph nodes for 53 patients.
For a sample of prostate cancer patients, a set of possible predictor variables were measured before surgery to determine if the lymph nodes were compromised. Subsequently, the patient underwent surgery and the status of his lymph nodes was determined.
The data set contains 53 observations of 7 variables:
id: identifiers for each subject in the study. ssln: takes the value of 1 if the cancer has spread to the lymph nodes and 0 if not. age: a numeric vector containing the age of the patient at the time of diagnosis. acid: a numerical vector that contains the levels of acid phosphatase in the blood (serum acid phosphatase or prostatic acid phosphatase PAP). High PAP levels may be associated with the presence of prostate cancer. xray: a measure of the seriousness of the cancer obtained from a radiological examination. A value of 1 represents a more serious case. size: Size of the tumor determined by palpation. A value of 1 identifies a large tumor that can be palpated without problems. grade: Another measure of tumor seriousness obtained from a pathologist reading a biopsy obtained using a needle prior to surgery. 1 corresponds to a more serious case.
Use R-studio to determine which of the variables taken before surgery are associated with the spread of cancer to the lymph nodes.
Please provide the code you used to solve this problem.
id | ssln | age | acid | xray | size | grade | |
1 | 1 | 0 | 66 | 0.48 | 0 | 0 | 0 |
2 | 2 | 0 | 68 | 0.56 | 0 | 0 | 0 |
3 | 3 | 0 | 66 | 0.5 | 0 | 0 | 0 |
4 | 4 | 0 | 56 | 0.52 | 0 | 0 | 0 |
5 | 5 | 0 | 58 | 0.5 | 0 | 0 | 0 |
6 | 6 | 0 | 60 | 0.49 | 0 | 0 | 0 |
7 | 7 | 0 | 65 | 0.46 | 1 | 0 | 0 |
8 | 8 | 0 | 60 | 0.62 | 1 | 0 | 0 |
9 | 9 | 1 | 50 | 0.56 | 0 | 0 | 1 |
10 | 10 | 0 | 49 | 0.55 | 1 | 0 | 0 |
11 | 11 | 0 | 61 | 0.62 | 0 | 0 | 0 |
12 | 12 | 0 | 58 | 0.71 | 0 | 0 | 0 |
13 | 13 | 0 | 51 | 0.65 | 0 | 0 | 0 |
14 | 14 | 1 | 67 | 0.67 | 1 | 0 | 1 |
15 | 15 | 0 | 67 | 0.47 | 0 | 0 | 1 |
16 | 16 | 0 | 51 | 0.49 | 0 | 0 | 0 |
17 | 17 | 0 | 56 | 0.5 | 0 | 0 | 1 |
18 | 18 | 0 | 60 | 0.78 | 0 | 0 | 0 |
19 | 19 | 0 | 52 | 0.83 | 0 | 0 | 0 |
20 | 20 | 0 | 56 | 0.98 | 0 | 0 | 0 |
21 | 21 | 0 | 67 | 0.52 | 0 | 0 | 0 |
22 | 22 | 0 | 63 | 0.75 | 0 | 0 | 0 |
23 | 23 | 1 | 59 | 0.99 | 0 | 0 | 1 |
24 | 24 | 0 | 64 | 1.87 | 0 | 0 | 0 |
25 | 25 | 1 | 61 | 1.36 | 1 | 0 | 0 |
26 | 26 | 1 | 56 | 0.82 | 0 | 0 | 0 |
27 | 27 | 0 | 64 | 0.4 | 0 | 1 | 1 |
28 | 28 | 0 | 61 | 0.5 | 0 | 1 | 0 |
29 | 29 | 0 | 64 | 0.5 | 0 | 1 | 1 |
30 | 30 | 0 | 63 | 0.4 | 0 | 1 | 0 |
31 | 31 | 0 | 52 | 0.55 | 0 | 1 | 1 |
32 | 32 | 0 | 66 | 0.59 | 0 | 1 | 1 |
33 | 33 | 1 | 58 | 0.48 | 1 | 1 | 0 |
34 | 34 | 1 | 57 | 0.51 | 1 | 1 | 1 |
35 | 35 | 1 | 65 | 0.49 | 0 | 1 | 0 |
36 | 36 | 0 | 65 | 0.48 | 0 | 1 | 1 |
37 | 37 | 0 | 59 | 0.63 | 1 | 1 | 1 |
38 | 38 | 0 | 61 | 1.02 | 0 | 1 | 0 |
39 | 39 | 0 | 53 | 0.76 | 0 | 1 | 0 |
40 | 40 | 0 | 67 | 0.95 | 0 | 1 | 0 |
41 | 41 | 0 | 53 | 0.66 | 0 | 1 | 1 |
42 | 42 | 1 | 65 | 0.84 | 1 | 1 | 1 |
43 | 43 | 1 | 50 | 0.81 | 1 | 1 | 1 |
44 | 44 | 1 | 60 | 0.76 | 1 | 1 | 1 |
45 | 45 | 1 | 45 | 0.7 | 0 | 1 | 1 |
46 | 46 | 1 | 56 | 0.78 | 1 | 1 | 1 |
47 | 47 | 1 | 46 | 0.7 | 0 | 1 | 0 |
48 | 48 | 1 | 67 | 0.67 | 0 | 1 | 0 |
49 | 49 | 1 | 63 | 0.82 | 0 | 1 | 0 |
50 | 50 | 1 | 57 | 0.67 | 0 | 1 | 1 |
51 | 51 | 1 | 51 | 0.72 | 1 | 1 | 0 |
52 | 52 | 1 | 64 | 0.89 | 1 | 1 | 0 |
53 | 53 | 1 | 68 | 1.26 | 1 | 1 | 1 |
Using Rstudio,
We are asked to establish a causal relationship between the binary response "ssln" and the predictors - age, acid, x-ray, size, grade. The appropriate model to fit this data would be a logistic regression, run as follows:
Looking at the deviance residuals, we find that the that the quartiles are approximately equidistant from the median and the figures can be said to be centered around zero, which implies that the data is symmetrical (approx.).
From the p-values of the predictors, we find that the variables "x-ray" (p-value = 0.01177<0.05) and "size"(p-value = 0.01380 < 0.05) are the only significant predictors contributing to the model and are hence, associated with the spread of cancer to the lymph nodes.