In: Statistics and Probability
Let's set up a regression problem by generating x and y variables. x is the explanatory variable and y is the response variable.
```{r}
set.seed(8)
x <- rnorm(100, 5, 10)
y <- x*4.3 + 5 + rnorm(100, 0, 40)
```
What is the theoretical slope of the population?
What is the theoretical intercept of the population?
What is the slope that is inferred from creating a regression model for the randomly generated sample?
What is the intercept that is inferred from creating a regression model for the randomly generated sample?
What is the correlation between x and y?
What proportion of the total variance is explained by the regression model?
Modify a point in the data:
```{r}
y[3] <- 250
```
This point is now ...
(a) none of the above
(b) an influential outlier
(c) a high leverage outlier
(d) an outlier
Please show all the R commands used to get the results as it needs to be specified.
set.seed(8)
x
<- rnorm(100,
5, 10)
y
<- x*4.3
+
5
+
rnorm(100,
0, 40)
Theoretical equation used above is of the form
y=β0+β1x+ϵ=5+4.3x+ϵ
What is the theoretical slope of the population?
theoretical slope β1=4.3
What is the theoretical intercept of the population?
theoretical intercept β0=5
Specifying and fitting the linear model in R.
library(tidyverse)
dfin
<- data.frame(y,x)
fit
<- lm(formula
= 'y~x',
data
= dfin)
res
<- summary(fit)
res
What is the slope that is inferred from creating a regression model for the randomly generated sample?
res$coefficients
##
Estimate Std. Error t value
Pr(>|t|)
## (Intercept) 4.759522 4.5943402
1.035953 3.027734e-01
##
x
4.325100 0.4003176 10.804170 2.194447e-18
β1=4.3251
What is the intercept that is inferred from creating a regression model for the randomly generated sample?
β0=4.759522
What is the correlation between x and y?
Correlation is given by square root of R-squared
res$r.squared
sqrt(res$r.squared)
Hence correlation between x and y = 0.7372923
What proportion of the total variance is explained by the regression model?
Proportion of the total variance is explained by the regression model is given by R-squared = 0.5436131
res$r.squared
## [1] 0.5436131
Modify a point in the data:
y[3] <-
250
dfin_new
<- data.frame(y,x)
This point is now …
(a) none of the above
(b) an influential outlier - an outlier is influential if it changes any aspect of the results of regression significantly. The intercept of our fitted model changed from 4.7595 to 8.2571, hence the point is influential.
fit2 <-
lm(formula
= 'y~x',
data
= dfin_new)
summary(fit2)
(c) a high leverage outlier - an outlier is high leverage if it has exceptionally high x value which is not the case in the data point here. We can see this in the plot below as well.
(d) an outlier - a point is outlier if the y value significantly differs from the general trend. We can see from the plot below that is the case. Hence the point is an outlier.
ggplot(dfin,
aes(x,
y, color =
y))
+
geom_point()
(b) and (d) are correct options
--- title: "Solution" output: word_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE) ``` ```{r} set.seed(8) x <- rnorm(100, 5, 10) y <- x*4.3 + 5 + rnorm(100, 0, 40) ``` Theoretical equation used above is of the form $$y = \beta_0 + \beta_1 x + \epsilon = 5 + 4.3x + \epsilon$$ ***** What is the theoretical slope of the population? theoretical slope $\beta_1=4.3$ ***** What is the theoretical intercept of the population? theoretical intercept $\beta_0=5$ ***** Specifying and fitting the linear model in R. ```{r} library(tidyverse) dfin <- data.frame(y,x) fit <- lm(formula = 'y~x', data = dfin) res <- summary(fit) res ``` What is the slope that is inferred from creating a regression model for the randomly generated sample? ```{r} res$coefficients ``` $\hat \beta_1=4.3251$ ***** What is the intercept that is inferred from creating a regression model for the randomly generated sample? $\hat \beta_0=4.759522$ ***** What is the correlation between x and y? Correlation is given by square root of R-squared ```{r} res$r.squared sqrt(res$r.squared) ``` Hence correlation between x and y = 0.7372923 ***** What proportion of the total variance is explained by the regression model? Proportion of the total variance is explained by the regression model is given by R-squared = 0.5436131 ```{r} res$r.squared ``` ***** Modify a point in the data: ```{r} y[3] <- 250 dfin <- data.frame(y,x) ``` This point is now ... (a) none of the above (b) an influential outlier - an outlier is influential if it changes any aspect of the results of regression significantly. The intercept of our fitted model changed from 4.7595 to 8.2571, hence the point is influential. ```{r} fit2 <- lm(formula = 'y~x', data = dfin) summary(fit2) ``` (c) a high leverage outlier - an outlier is high leverage if it has exceptionally high x value which is not the case in the data point here. We can see this in the plot below as well. (d) an outlier - a point is outlier if the y value significantly differs from the general trend. We can see from the plot below that is the case. Hence the point is an outlier. ```{r} ggplot(dfin, aes(x, y, color = y)) + geom_point() ``` (b) and (d) are correct options