In: Statistics and Probability
Logistic Regression
In logistic regression we are interested in determining the
outcome of a categorical variable. In most cases, we deal with
binomial logistic regression with the binary response variable, for
example yes/no, passed/failed, true/false, and others. Recall that
logistic regression can be applied to classification problems when
we want to determine a class of an event based on the values of its
features.
In this assignment we will use the heart data located
at
http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29
Here is the description of the data:
This database contains 13 attributes (which have been extracted
from a larger set of 75)
Attribute Information:
------------------------
-- 1.
age
-- 2.
sex
-- 3. chest pain type (4
values)
-- 4. resting blood pressure
-- 5. serum cholestoral in
mg/dl
-- 6. fasting blood sugar > 120
mg/dl
-- 7. resting electrocardiographic
results (values 0,1,2)
-- 8. maximum heart rate
achieved
-- 9. exercise induced
angina
-- 10. oldpeak = ST depression
induced by exercise relative to rest
-- 11. the slope of the peak
exercise ST segment
-- 12. number of major vessels (0-3)
colored by flourosopy
-- 13. thal: 3 = normal; 6 = fixed
defect; 7 = reversable defect
-- 14 output : Variable to be
predicted, absence (1) or presence (2) of heart disease
Part 1: Load Data
Load data, heart.csv into a data frame named
heartData.
Part 2: Preprocess, Clean the
Data
In this data set there are some numerical features that are
actually categorical variables. The features CHESTPAIN,THAL, and
ECG features are all categorical features. The EXERCISE variable is
an ordered categorical variable, and so it is a categorical
variable. These variables need to be encoded as factors.
A) Convert CHESTPAIN,THAL, ECG, and EXERCISE
features to factors.
Recall that the function factor() can be used to
accomplish that.
Here is the code that converts CHESTPAIN to a
factor:
heartData$CHESTPAIN =
factor(heartData$CHESTPAIN)
B) Logistic regression requires the target
variable's value to be between 0 and 1 (both included). In this
data set the target variable OUTPUT has values of 1 and 2. If you
don't change them, the glm() method gives the following
error:
In OUTPUT column replace 1 by 0 and 2 by 1.
Hint: ifelse
Part 3: Graphs
A) Plot the scatter plot of AGE by CHOL.
Question: Is there any relationship between age and cholestoral?
B) Plot the scatter plot of AGE by resting blood pressure (RESTBP).
Question: Is there any relationship between age and resting blood pressure?
C) Plot the scatter plot of AGE by maximum heart rate achieved(MAXHR)
Question: Is there any relationship between age and maximum heart rate achieved?
D) Plot the histogram of AGE.
Question: Is data left-skewed or right-skewed?
Part 4: Create Test and Training Data Frames
Split 80% of data for training and 20% for testing
Part 5: Build Logistic Regression Model
Build a model using glm() function with the parameter
family=binomial(link="logit"). The formula for the glm() should
have OUTPUT as the dependent variable and all other
variables as the independent variables. In other words, we would
like to regress OUTPUT on the remaining predictors.
Use summary() method to print the model result.
Question: Based on the output of the summary(),
which variables seem to be the most important?
Hint: Refer to textbook pages 159-161.
Part 6: Make Predictions
Use predict() function to find the predictions of the model on the test data created in Part 3 above. Make sure that you are using the parameter type ="response" in predict() function.
Hint: Check listing 7.10 in the textbook. Also, lecture notes have samples.
Part 7: Performance of the
Model
A) Confusion Matrix: The output of the predict()
function is the probability of the input belonging to class 1. We
can perform binary classification by applying a threshold.
Use threshold of 0.5 for the output returned from the predict()
function. In other words, get all values returned by the predict()
function that are greater than 0.5. Then, apply the function
as.numeric() to convert them(FALSE, TRUE) to 0 or
1 so that the table() function can be used to print prediction
values vs real values.(This is required because OUTPUT column has 0
or 1 as the values.)
Check textbook page 164, listing 7.13.
Question: Based on the confusion matrix you found,
how many individuals were correctly classified as having heart
disease? How many of them were incorrectly classified?
B) ROC Curve
For this part you need to install the package ROCR.
i) Plot the ROC curve
Note: Your graph might be similar but not exactly the same as the one shown above.
ii) Print the AUC(area under the curve).
Hint: Textbook page 102, listing 5.7.
Part 8: Visualize Heart Disease By Age
Range Using a Dodged Bar Chart
A) Use AGE predictor to create the following ranges by using the cut() function.
[10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]
Add the result of the cut() function as a new column named
AgeRange to the data frame heartData you created in Part
1.
B) Use a dodged bar chart to display the
distribution of heart disease data using
AgeRange.
Hint: Textbook pages 58-60
Part 9: Apply Your Model
Question: Which of the following individuals will be classified as a patient with heart disease?
AGE |
SEX |
CHESTPAIN |
RESTBP |
CHOL |
SUGAR |
ECG |
MAXHR |
ANGINA |
DEP |
EXERCISE |
FLOUR |
THAL |
65 |
1 |
3 |
150 |
122 |
1 |
1 |
809 |
1 |
4.5 |
1 |
2 |
3 |
56 |
0 |
4 |
95 |
200 |
0 |
2 |
360 |
0 |
3.6 |
2 |
3 |
7 |
Part1:Rcode
library(readxl)
heartData <-
read_excel("C:/Users/M1045151/Downloads/heardisease.xlsx")
View(heartData)
dim(heartData)
#there were 270 observations with 14 columns
Part2A:
A) Convert CHESTPAIN,THAL, ECG, and EXERCISE
features to factors.
Recall that the function factor() can be used to
accomplish that.
Here is the code that converts CHESTPAIN to a
factor:
Rcode:
heartData$CHESTPAIN = factor(heartData$CHESTPAIN)
heartData$THAL = factor(heartData$THAL)
heartData$ECG = factor(heartData$ECG )
heartData$EXERCISE = factor(heartData$EXERCISE )
str(heartData)
with str function you can check whether it has been converted to factor or not
Part 3: Graphs
A) Plot the scatter plot of AGE by CHOL.
Question: Is there any relationship between age and cholestoral?
B) Plot the scatter plot of AGE by resting blood pressure (RESTBP).
plot(heartData$AGE,heartData$CHOL,main="Histogram of Age vs
Cholesterol")
cor(heartData$AGE,heartData$CHOL)
r=0.2200563
#there exists a weak positive relatonship between age and
cholesterol
Question: Is there any relationship between age and resting blood pressure?
plot(heartData$AGE,heartData$RESTBP,main="Histogram of Age vs
RestBP ")
cor(heartData$AGE,heartData$RESTBP)
r=0.2730528
#there exists a weak positive relationship between age and Rest BP
C) Plot the scatter plot of AGE by maximum heart rate achieved(MAXHR)
plot(heartData$AGE,heartData$MAXHR,main="Histogram of Age vs MaxHR
"))
cor(heartData$AGE,heartData$MAXHR)
Question: Is there any relationship between age and maximum heart rate ac
hieved?
r= -0.4022154.There exists a moderate weak relationship between
age and maximum heart rate achieved
D) Plot the histogram of AGE.
hist(heartData$AGE,main="Histogram of Age")
Question: Is data left-skewed or right-skewed?
mean of age=54.43333
median of age=55
mean <median
left skewed.
it is negatively skewed.