In: Statistics and Probability
Using the data, fit an appropriate regression model to determine
whether
time spent studying (hours) is a useful predictor of the chance of
passing the exam (result, 0=fail 1=pass). Formally assess
the overall fit of the model.
DATA three;
INPUT result hours;
/* result=0 is fail; result=1 is pass */
cards;
0 0.8
0 1.6
0 1.4
1 2.3
1 1.4
1 3.2
0 0.3
1 1.7
0 1.8
1 2.7
0 0.6
0 1.1
1 2.1
1 2.8
1 3.4
1 3.6
0 1.7
1 0.9
1 2.2
1 3.1
0 1.4
1 1.9
0 0.4
0 1.6
1 2.5
1 3.2
1 1.7
1 1.9
0 2.2
0 1.3
1 1.5
;
run;
Question
Solution :-
We can run a logistic regression to answer this question. Following is the python code that you can use to find the regression result.
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
Pass_or_Fail=[0,0,0,1,1,1,0,1,0,1,0,0,1,1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,0,0,1]
Hours=[0.8,1.6,1.4,2.3,1.4,3.2,0.3,1.7,1.8,2.7,0.6,1.1,2.1,2.8,3.4,3.6,1.7,0.9,2.2,3.1,1.4,1.9,0.4,1.6,2.5,3.2,1.7,1.9,2.2,1.3,1.5]
import statsmodels.api as sm
logit_model=sm.Logit(Pass_or_Fail,Hours)
result=logit_model.fit()
print(result.summary2())
This code gave me the following result-
Optimization terminated successfully. Current function value: 0.607547 Iterations 5 Results: Logit ============================================================== Model: Logit Pseudo R-squared: 0.107 Dependent Variable: y AIC: 39.6679 Date: 2020-04-27 17:29 BIC: 41.1019 No. Observations: 31 Log-Likelihood: -18.834 Df Model: 0 LL-Null: -21.083 Df Residuals: 30 LLR p-value: nan Converged: 1.0000 Scale: 1.0000 No. Iterations: 5.0000 ----------------------------------------------------------------- Coef. Std.Err. z P>|z| [0.025 0.975] ----------------------------------------------------------------- x1 0.4313 0.2017 2.1390 0.0324 0.0361 0.8266 ==============================================================
We should not look at R-square in this kind of regression. But what we can look at is the improvement in odds ratio. We can say that if we increase the hours of study by 1 then odds of passing the exam improves by 0.4313. This number is significant at 5% level. That means that hours of study is an important varibale in explaining the pass or fail result of the exam.