In: Statistics and Probability
Please use R to solve part e and f
The data file data2.txt gives a data set with two variables x and y. The first column in the data set is just row numbers not useful for this question.
(e) Use the Shapiro-Wilks test to test for Normality of the data. State your null and alternative hypotheses, p-value and conclusion. Use α = 0.05
(f) Apply the transformation y 0 = log(y) and run the regression on y 0 on x. Now repeat parts (c), (d) and (e) with the residuals from this transformed model.
Data file:
Row x y
1 60.26 63.95
2 64.64 67.42
3 69.17 73.36
4 61.49 65.30
5 65.10 68.74
6 61.34 65.22
7 84.12 86.78
8 73.58 75.01
9 69.51 71.46
10 51.94 56.18
11 54.39 58.07
12 69.25 72.53
13 76.64 78.53
14 73.16 75.50
15 67.99 70.37
16 42.23 47.08
17 62.95 67.80
18 70.12 74.66
19 63.96 66.32
20 60.32 64.22
21 60.33 64.56
22 55.28 58.95
23 51.48 58.13
24 76.90 84.55
25 69.79 71.75
26 79.31 80.64
27 68.12 71.40
28 65.70 68.22
29 50.85 54.99
30 59.47 63.75
I am sharing R markdown output file. R commands are in blue , output in bold black and comments/conclusions in italic bold brown.
library(xlsx)
importing
dataframe
xy_data =
read.xlsx("C:\\Users\\ADMIN\\Desktop\\xydata.xlsx",1)
Accessing x
and y variables from the data
x =
xy_data$x
y =
xy_data$y
e) Checking
normality of both the variables.
level of
significance (alpha) = 0.05.
Hypothesis
–
H0: Data comes from normally distributed population. VS
H1: Data is
not from normally distributed
population.
For
variable x -
shapiro.test(x)
## Shapiro-Wilk normality
test
##
## data:
x
## W = 0.98819,
p-value = 0.9786
p - value =
0.9786
p - value
> alpha i.e 0.9786>0.05
Decision -
Accept H0 ( Null hypothesis is rejected if p-value <
alpha)
Conclusion
- Variable x is from
normally distributed
population.
For
variable y -
H0: Data
comes from normally distributed
population. VS
H1: Data is
not from normally distributed
population.
shapiro.test(y)
##
## Shapiro-Wilk
normality test
##
## data:
y
## W = 0.98766,
p-value = 0.9735
p - value =
0.9735
p - value
> alpha i.e 0.9735>0.05
Decision -
Accept H0
conclusion - Variable y is from
normally distributed population.
Hence ,
data comes from normally distributed
population.
f)
Transformation on y -
y0 =
log(y)
Using log
to the base e function to transform
y
y0 =
log(y)
Fitting
regression line on y0 on x taking y0 as response and x as
predictor.
fit
= lm(y0~x)
summary(fit)
##
##
Call:
## lm(formula = y0
~ x)
##
##
Residuals:
##
Min 1Q
Median
3Q Max
## -0.046893
-0.012009 -0.004173 0.010305 0.051214
##
##
Coefficients:
##
Estimate Std. Error t value
Pr(>|t|)
## (Intercept)
3.3050759 0.0265154 124.65 <2e-16
***
##
x
0.0140579 0.0004061 34.62 <2e-16
***
##
---
## Signif. codes: 0
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual
standard error: 0.02037 on 28 degrees of
freedom
## Multiple
R-squared: 0.9772, Adjusted R-squared: 0.9764
## F-statistic:
1198 on 1 and 28 DF, p-value: <
2.2e-16
Values of Rsquare are close to 1. Hence model is strong.
Accessing
residuals from the fitted model
residuals
= fit$residuals
Checking
normaltiy of the residuals
H0: Data
comes from normally distributed
population. VS
H1: Data is
not from normally distributed
population.
shapiro.test(residuals)
##
## Shapiro-Wilk
normality test
##
## data:
residuals
## W = 0.98349,
p-value = 0.9088
p - value =
0.9088
p - value
> alpha i.e 0.9088>0.05
Decision -
Accept H0
Conclusion
- residuals obtained from
the regression model are from normally distributed
population.