In: Statistics and Probability
Write a R code
HW09 The National Institute of Diabetes and Digestive and Kidney Diseases
conducted a study on 768 adult female Pima Indians living near Phoenix.
The pima dataset resulting from the study is available in R and contains the following variables:
test - results of a test to determine if the female patient shows signs of diabetes
(coded 0 if negative, 1 if positive)
age - Age (years)
bmi - Body mass index (weight in kg/(height in metres squared))
diastolic - Diastolic blood pressure (mm Hg)
diabetes - Diabetes pedigree function
glucose - Plasma glucose concentration at 2 hours in an oral glucose tolerance test
insulin - 2-Hour serum insulin (mu U/ml)
pregnant - Number of times pregnant
triceps - Triceps skin fold thickness (mm)
The following logistic model with test as the response was fit with the following table results:
logistic <- glm(test ~ age + bmi + diastolic + diabetes + glucose + insulin + pregnant +
triceps,family=binomial(logit),data=pima)
Reference |
|||
0 |
1 |
||
Prediction |
0 |
446 |
117 |
1 |
54 |
151 |
A positive test result (test=1) was modeled as an event. Compute the following measures:
Accuracy
Sensitivity
Specificity
Positive Predictive Value
Negative Predictive Value
R-code
#Importing the dataset
library(readr)
d <- read_csv("C:/Users/admin/Downloads/diabetes.csv",
col_types = cols(Age = col_integer(),
BMI = col_number(), BloodPressure = col_integer(),
DiabetesPedigreeFunction = col_number(),
Glucose = col_integer(), Insulin = col_integer(),
Outcome = col_integer(), Pregnancies = col_integer(),
SkinThickness = col_integer()))
# Understand the dataset
View(db)
head(db)
str(db)
#Finding the summary statistic of the dataset
summary(db)
#splitting the dataset into test train data with ratio 0.7
set.seed(5)
dt = sort(sample(nrow(db), nrow(db)*.7))
train<-db[dt,]
test<-db[-dt,]
#checking the splittitted datsets
nrow(db)
nrow(train)
nrow(test)
#Fitting logistic regression model with dataset db
logistic <- glm(Outcome ~ Age + BMI + BloodPressure +
DiabetesPedigreeFunction + Glucose + Insulin + Pregnancies
+SkinThickness,family=binomial(logit),data=train)
summary(logistic)
#Confusion Matrix less than 0.5 not survival
prdVal <- predict(logistic, type='response')
summary(prdVal)
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=train$Outcome)
confusionMatrix(cnfmtrx)
# Build confusion matrix with a threshold value of 0.5
threshold_0.5 <- table(train$Outcome, prdVal > 0.5)
threshold_0.5
# Accuracy
accuracy_0.5 <-
round(sum(diag(threshold_0.5))/sum(threshold_0.5),2)
sprintf("Accuracy is %s",accuracy_0.5)
#sensitivity
sensitivity_0.5 <- round(117/(117+77),2)
sprintf("Sensitivity at 0.5 threshold: %s", sensitivity_0.5)
#specificity
specificity_0.5 <- round(304/(304+39),2)
sprintf("Specificity at 0.5 threshold: %s", specificity_0.5)
Formulae:
Confusion Matrix: Compares the actual outcomes with the predicted ones
Predicted = 0 | Predicted = 1 | |
---|---|---|
Actual = 0 | True Negatives (TN) | False Positives (FP) |
Actual = 1 | False Negatives (FN) | True Positives (TP) |
Sensitivity = (TP / TP + FN) (True Positive
rate)
Specificity = (TN / TN + FP) (True Negative
rate)
The model with a higher threshold has
lower Sensitivity but higher
Specificity.
The model with a lower threshold has
higher Sensitivity but lower
Specificity.
Thresholding:(here considered as 0.5)
The outcome of a logistic regression model is a
probability.
We can do this using a threshold value
t