Question

In: Statistics and Probability

Write a R code HW09  The National Institute of Diabetes and Digestive and Kidney Diseases conducted a...

Write a R code

HW09  The National Institute of Diabetes and Digestive and Kidney Diseases

conducted a study on 768 adult female Pima Indians living near Phoenix.

The pima dataset resulting from the study is available in R and contains the following variables:

test - results of a test to determine if the female patient shows signs of diabetes

         (coded 0 if negative, 1 if positive)

age - Age (years)

bmi - Body mass index (weight in kg/(height in metres squared))

diastolic - Diastolic blood pressure (mm Hg)

diabetes - Diabetes pedigree function

glucose - Plasma glucose concentration at 2 hours in an oral glucose tolerance test

insulin - 2-Hour serum insulin (mu U/ml)

pregnant - Number of times pregnant

triceps - Triceps skin fold thickness (mm)

The following logistic model with test as the response was fit with the following table results:

logistic <- glm(test ~ age + bmi + diastolic + diabetes + glucose + insulin + pregnant +

triceps,family=binomial(logit),data=pima)

          Reference

0

1

Prediction

0

446

117

1

54

151

A positive test result (test=1) was modeled as an event. Compute the following measures:

Accuracy

Sensitivity

Specificity

Positive Predictive Value

Negative Predictive Value

Solutions

Expert Solution

R-code

#Importing the dataset
library(readr)
d <- read_csv("C:/Users/admin/Downloads/diabetes.csv",
col_types = cols(Age = col_integer(),
BMI = col_number(), BloodPressure = col_integer(),
DiabetesPedigreeFunction = col_number(),
Glucose = col_integer(), Insulin = col_integer(),
Outcome = col_integer(), Pregnancies = col_integer(),
SkinThickness = col_integer()))

# Understand the dataset
View(db)
head(db)
str(db)

#Finding the summary statistic of the dataset
summary(db)

#splitting the dataset into test train data with ratio 0.7
set.seed(5)
dt = sort(sample(nrow(db), nrow(db)*.7))
train<-db[dt,]
test<-db[-dt,]

#checking the splittitted datsets
nrow(db)
nrow(train)
nrow(test)

#Fitting logistic regression model with dataset db
logistic <- glm(Outcome ~ Age + BMI + BloodPressure + DiabetesPedigreeFunction + Glucose + Insulin + Pregnancies +SkinThickness,family=binomial(logit),data=train)
summary(logistic)

#Confusion Matrix less than 0.5 not survival
prdVal <- predict(logistic, type='response')
summary(prdVal)
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=train$Outcome)
confusionMatrix(cnfmtrx)

# Build confusion matrix with a threshold value of 0.5

threshold_0.5 <- table(train$Outcome, prdVal > 0.5)
threshold_0.5

# Accuracy
accuracy_0.5 <- round(sum(diag(threshold_0.5))/sum(threshold_0.5),2)
sprintf("Accuracy is %s",accuracy_0.5)

#sensitivity
sensitivity_0.5 <- round(117/(117+77),2)
sprintf("Sensitivity at 0.5 threshold: %s", sensitivity_0.5)

#specificity
specificity_0.5 <- round(304/(304+39),2)
sprintf("Specificity at 0.5 threshold: %s", specificity_0.5)

Formulae:

Confusion Matrix: Compares the actual outcomes with the predicted ones

Predicted = 0 Predicted = 1
Actual = 0 True Negatives (TN) False Positives (FP)
Actual = 1 False Negatives (FN) True Positives (TP)

Sensitivity = (TP / TP + FN) (True Positive rate)
Specificity = (TN / TN + FP) (True Negative rate)

The model with a higher threshold has lower Sensitivity but higher Specificity.
The model with a lower threshold has higher Sensitivity but lower Specificity.

Thresholding:(here considered as 0.5)

The outcome of a logistic regression model is a probability.
We can do this using a threshold value t

  • If P(y=1) >= t, predict 1
  • If P(y=1) < t, predict 0

Related Solutions

3. According to the National Institute of Allergy and Infectious Diseases, approximately p = 0.08 of...
3. According to the National Institute of Allergy and Infectious Diseases, approximately p = 0.08 of Americans suffer from hay fever. A random sample of individuals suffering from hay fever was obtained out an approximate population of N = 25,500,000, and each was treated with either a conventional antihistamine, ?1 = 255 or butterbur extract, ?2 = 237. The number of individuals who experienced relief from the conventional antihistamine was ?1 = 55 or ?̂1= 0.278 while the number of...
According to the National Institute of Allergy and Infectious Diseases, approximately 6% of U.S. children 4...
According to the National Institute of Allergy and Infectious Diseases, approximately 6% of U.S. children 4 years of age or younger have a food allergy. A day care program has capacity for 10 children in that age range. Assume that the children attending the day care program are independent. Let the random variable X be the number of children in this day care who have a food allergy. 1. Which distribution does X follow? a. Normal distribution with mean 8...
The National Cancer Institute conducted a 2-year study to determine whether cancer death rates for areas...
The National Cancer Institute conducted a 2-year study to determine whether cancer death rates for areas near nuclear power plants are higher than for areas without nuclear facilities. A spokesperson for the Cancer Institute said, "From the data at hand, there was no convincing evidence of any increased risk of death from any of the cancers surveyed due to living near nuclear facilities.” (1 points for each)    Let p denote the proportion of the population in areas near nuclear...
Use R.  Provide Solution and R Code within each problem. A study was conducted to determine the...
Use R.  Provide Solution and R Code within each problem. A study was conducted to determine the average weight of newborn babies. The distribution of the weight of newborn babies has a standard deviation of σ = 1.25lbs. A random sample of 100 newborn babies was taken and weights measured. The mean weight of the babies in the sample was 7.3 lbs. a. Construct a 95% confidence interval for the true mean weight of newborn babies. b. Interpret the confidence interval...
Write n essay Based on data from the U.S. Department of Education and the National Institute...
Write n essay Based on data from the U.S. Department of Education and the National Institute of Literacy (2015), what other verbal and non-verbal competencies should health care providers be cognizant of in treating patients?
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R....
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R. (a) Calculate the mean, median, standard deviation, 30th percentile, and 65th percentile for Mileage and TripTime. (b) Make a frequency table for PaymentProvider that includes a Sum column. Report the resulting table. (c) Make a contingency table comparing PaymentType and Airport. Report the resulting table. (d) Use the cor() function to find the correlation between each pair of the Meter, Tip, Mileage, and TripTime...
in R To explore the characteristics of a Type I error rate, write the R code...
in R To explore the characteristics of a Type I error rate, write the R code to do the following: (a) Generate 30 values from X~N(μX =10,σX=4) and 30 values from Y~N(μY =10,σY=4). Do not print any of these values. Use a t-test to test the hypotheses given above. (You are allowed to use the built-in R function to perform the t-test.) (b) Include a comment in your code that identifies the p-value and clearly state the conclusion of the...
PLEASE WRITE IN R CODE. Has to output on R software. (1) The stem length of...
PLEASE WRITE IN R CODE. Has to output on R software. (1) The stem length of soybeans from an experiment are: 20.2, 22.9, 23.3, 20.0, 19.4, 22.0, 22.1, 22.0, 21.9, 21.5, 20.9 a. Create a histogram to visualize the data b. Test "t.test" whether the population mean is different from 22 c. Obtain a 2 sided 98% confidence interval on the true mean using "t.test". d. The researcher, by using "t.test" on a sample size of 11 was assuming that...
Write a R-script to (and show the outputs of your code) (a) Create a sequence of...
Write a R-script to (and show the outputs of your code) (a) Create a sequence of numbers starting at 3.5 and ending at 10.7 with increments of 0.79. Find the variance and mean of those numbers. And finally sort the vector in a decreasing manner (b) Create a 3 different 3 by 3 matrices such that each of the numbers 1,2,...,9 appear exactly once (Sudoku style) in each of the matrices.
R studio questions Write up your answers and paste the R code Copy and paste all...
R studio questions Write up your answers and paste the R code Copy and paste all plots generated. First create a sample drawn from a normal random variable. R has many distributions for which you can get probabilities and draw random numbers. We are going to use the normal. Go to help in R and type in rnorm. You will see a write up for functions associated with the normal distribution. dnorm is the density; pnorm is the probability distribution...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT