Question

In: Statistics and Probability

Logistic Regression In logistic regression we are interested in determining the outcome of a categorical variable....

Logistic Regression

In logistic regression we are interested in determining the outcome of a categorical variable. In most cases, we deal with binomial logistic regression with the binary response variable, for example yes/no, passed/failed, true/false, and others. Recall that logistic regression can be applied to classification problems when we want to determine a class of an event based on the values of its features.
  
In this assignment we will use the heart data located at  

http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29

Here is the description of the data:

This database contains 13 attributes (which have been extracted from a larger set of 75)     
  
Attribute Information:
------------------------
      -- 1. age     
      -- 2. sex     
      -- 3. chest pain type (4 values)     
      -- 4. resting blood pressure
      -- 5. serum cholestoral in mg/dl    
      -- 6. fasting blood sugar > 120 mg/dl     
      -- 7. resting electrocardiographic results (values 0,1,2)
      -- 8. maximum heart rate achieved
      -- 9. exercise induced angina  
      -- 10. oldpeak = ST depression induced by exercise relative to rest
      -- 11. the slope of the peak exercise ST segment   
      -- 12. number of major vessels (0-3) colored by flourosopy      
      -- 13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
      -- 14 output : Variable to be predicted, absence (1) or presence (2) of heart disease


Part 1: Load Data

Load data, heart.csv into a data frame named heartData.

Part 2: Preprocess, Clean the Data
In this data set there are some numerical features that are actually categorical variables. The features CHESTPAIN,THAL, and ECG features are all categorical features. The EXERCISE variable is an ordered categorical variable, and so it is a categorical variable. These variables need to be encoded as factors.

A) Convert CHESTPAIN,THAL, ECG, and EXERCISE features to factors.
   Recall that the function factor() can be used to accomplish that.
   Here is the code that converts CHESTPAIN to a factor:

   heartData$CHESTPAIN = factor(heartData$CHESTPAIN)

B) Logistic regression requires the target variable's value to be between 0 and 1 (both included). In this data set the target variable OUTPUT has values of 1 and 2. If you don't change them, the glm() method gives the following error:

In OUTPUT column replace 1 by 0 and 2 by 1.

Hint: ifelse

Part 3: Graphs

A) Plot the scatter plot of AGE by CHOL.

Question: Is there any relationship between age and cholestoral?

B) Plot the scatter plot of AGE by resting blood pressure (RESTBP).

Question: Is there any relationship between age and resting blood pressure?

C) Plot the scatter plot of AGE by maximum heart rate achieved(MAXHR)

Question: Is there any relationship between age and maximum heart rate achieved?

D) Plot the histogram of AGE.

Question: Is data left-skewed or right-skewed?

Part 4: Create Test and Training Data Frames

Split 80% of data for training and 20% for testing

Part 5: Build Logistic Regression Model

Build a model using glm() function with the parameter family=binomial(link="logit"). The formula for the glm() should have OUTPUT as the dependent variable and all other variables as the independent variables. In other words, we would like to regress OUTPUT on the remaining predictors.

Use summary() method to print the model result.
Question: Based on the output of the summary(), which variables seem to be the most important?

Hint: Refer to textbook pages 159-161.

Part 6: Make Predictions

Use predict() function to find the predictions of the model on the test data created in Part 3 above. Make sure that you are using the parameter type ="response" in predict() function.

Hint: Check listing 7.10 in the textbook. Also, lecture notes have samples.

Part 7: Performance of the Model

A) Confusion Matrix: The output of the predict() function is the probability of the input belonging to class 1. We can perform binary classification by applying a threshold.

Use threshold of 0.5 for the output returned from the predict() function. In other words, get all values returned by the predict() function that are greater than 0.5. Then, apply the function as.numeric() to convert them(FALSE, TRUE) to 0 or 1 so that the table() function can be used to print prediction values vs real values.(This is required because OUTPUT column has 0 or 1 as the values.)


Check textbook page 164, listing 7.13.

Question: Based on the confusion matrix you found, how many individuals were correctly classified as having heart disease? How many of them were incorrectly classified?

B) ROC Curve
For this part you need to install the package ROCR.

i) Plot the ROC curve

Note: Your graph might be similar but not exactly the same as the one shown above.


ii) Print the AUC(area under the curve).

Hint: Textbook page 102, listing 5.7.

Part 8: Visualize Heart Disease By Age Range Using a Dodged Bar Chart

A) Use AGE predictor to create the following ranges by using the cut() function.

[10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80]

Add the result of the cut() function as a new column named AgeRange to the data frame heartData you created in Part 1.

B) Use a dodged bar chart to display the distribution of heart disease data using AgeRange.


Hint: Textbook pages 58-60

Part 9: Apply Your Model

Question: Which of the following individuals will be classified as a patient with heart disease?

AGE

SEX

CHESTPAIN

RESTBP

CHOL

SUGAR

ECG

MAXHR

ANGINA

DEP

EXERCISE

FLOUR

THAL

65

1

3

150

122

1

1

809

1

4.5

1

2

3

56

0

4

95

200

0

2

360

0

3.6

2

3

7

Solutions

Expert Solution

Part1:Rcode

library(readxl)
heartData <- read_excel("C:/Users/M1045151/Downloads/heardisease.xlsx")
View(heartData)

dim(heartData)

#there were 270 observations with 14 columns

Part2A:

A) Convert CHESTPAIN,THAL, ECG, and EXERCISE features to factors.
   Recall that the function factor() can be used to accomplish that.
   Here is the code that converts CHESTPAIN to a factor:

Rcode:



heartData$CHESTPAIN = factor(heartData$CHESTPAIN)
heartData$THAL = factor(heartData$THAL)
heartData$ECG = factor(heartData$ECG )
heartData$EXERCISE = factor(heartData$EXERCISE )

str(heartData)

with str function you can check whether it has been converted to factor or not

Part 3: Graphs

A) Plot the scatter plot of AGE by CHOL.

Question: Is there any relationship between age and cholestoral?

B) Plot the scatter plot of AGE by resting blood pressure (RESTBP).

plot(heartData$AGE,heartData$CHOL,main="Histogram of Age vs Cholesterol")
cor(heartData$AGE,heartData$CHOL)

r=0.2200563
#there exists a weak positive relatonship between age and cholesterol

Question: Is there any relationship between age and resting blood pressure?

plot(heartData$AGE,heartData$RESTBP,main="Histogram of Age vs RestBP ")
cor(heartData$AGE,heartData$RESTBP)

r=0.2730528

#there exists a weak positive relationship between age and Rest BP

C) Plot the scatter plot of AGE by maximum heart rate achieved(MAXHR)


plot(heartData$AGE,heartData$MAXHR,main="Histogram of Age vs MaxHR "))
cor(heartData$AGE,heartData$MAXHR)

Question: Is there any relationship between age and maximum heart rate ac

hieved?

r= -0.4022154.There exists a moderate weak relationship between

age and maximum heart rate achieved

D) Plot the histogram of AGE.

hist(heartData$AGE,main="Histogram of Age")

Question: Is data left-skewed or right-skewed?

mean of age=54.43333

median of age=55

mean <median

left skewed.

it is negatively skewed.


Related Solutions

Multinomial logistic regression can be used on: a)Categorical predictor variables only. b)Both categorical and continuous predictor...
Multinomial logistic regression can be used on: a)Categorical predictor variables only. b)Both categorical and continuous predictor variables. c)Continuous predictor variables only. d)Ordinal predictor variables only.
If a dependent variable is binary, is it optimal to use linear regression or logistic regression?...
If a dependent variable is binary, is it optimal to use linear regression or logistic regression? Explain your answer and include the theoretical and practical concerns associated with each regression model. Provide a business-related example to illustrate your ideas.
Logistic regression predicts a 1._____________, 2._____________, 3.______________from one or more categorical or continuous predictor variables.
Logistic regression predicts a 1._____________, 2._____________, 3.______________from one or more categorical or continuous predictor variables.
For what type of dependent variable is logistic regression appropriate? Give an example of such a...
For what type of dependent variable is logistic regression appropriate? Give an example of such a variable. In what metric are logistic regression coefficients? What can we do to them to make them more interpretable, and how would we interpret the resulting translated coefficients? (Understanding and Using Statistics for Criminology and Criminal Justice)
1. How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood...
1. How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood function (LLF) is : LLF = Σi( i log( ( i)) + (1 − i) log(1 − ( i))). y p x y p x How logistic regression uses this in maximum likelihood estimation? 2. We can apply PCA to reduce features in a data set for model construction. But, why do we still need regularization? What is the difference between lasso and ridge...
How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood function...
How logistic regression maps all outcome to either 0 or 1. The equation for log-likelihood function (LLF) is : LLF = Σi( i log( ( i)) + (1 − i) log(1 − ( i))). y p x y p x How logistic regression uses this in maximum likelihood estimation?
In a few sentences, explain why we do a logistic transformation of the outcome data before...
In a few sentences, explain why we do a logistic transformation of the outcome data before doing the logistic regression.
When should logistic regression be used for data analysis? What is the assumption of logistic regression?...
When should logistic regression be used for data analysis? What is the assumption of logistic regression? How to explain odds ratio?
QUESTION 1) What do we need for translating the probability of categorical outcome to class membership?...
QUESTION 1) What do we need for translating the probability of categorical outcome to class membership? Group of answer choices: a) The logit. b) Hyperparameters. c) The odds ratio. d) A cutoff value. QUESTION 2) Which of the following is true regarding profiling and classification using logistic regression? A) The goal of profiling is to identify the significant predictors that help differentiate between class 1 and class 0. B) The goal of classification is predicting which class an observation would...
define the logistic regression model.
define the logistic regression model.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT