Question

In: Statistics and Probability

Let's set up a regression problem by generating x and y variables. x is the explanatory...

Let's set up a regression problem by generating x and y variables. x is the explanatory variable and y is the response variable.

```{r}

set.seed(8)

x <- rnorm(100, 5, 10)

y <- x*4.3 + 5 + rnorm(100, 0, 40)

```

What is the theoretical slope of the population?

What is the theoretical intercept of the population?

What is the slope that is inferred from creating a regression model for the randomly generated sample?

What is the intercept that is inferred from creating a regression model for the randomly generated sample?

What is the correlation between x and y?

What proportion of the total variance is explained by the regression model?

Modify a point in the data:

```{r}

y[3] <- 250

```

This point is now ...

(a) none of the above

(b) an influential outlier

(c) a high leverage outlier

(d) an outlier

Please show all the R commands used to get the results as it needs to be specified.

Solutions

Expert Solution

set.seed(8)
x <- rnorm(100, 5, 10)
y <- x*4.3 + 5 + rnorm(100, 0, 40)

Theoretical equation used above is of the form

y=β0+β1x+ϵ=5+4.3x+ϵ

What is the theoretical slope of the population?

theoretical slope β1=4.3

What is the theoretical intercept of the population?

theoretical intercept β0=5

Specifying and fitting the linear model in R.

library(tidyverse)
dfin <- data.frame(y,x)
fit <- lm(formula = 'y~x', data = dfin)
res <- summary(fit)
res

What is the slope that is inferred from creating a regression model for the randomly generated sample?

res$coefficients

##             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 4.759522 4.5943402 1.035953 3.027734e-01
## x           4.325100 0.4003176 10.804170 2.194447e-18

β1=4.3251

What is the intercept that is inferred from creating a regression model for the randomly generated sample?

β0=4.759522

What is the correlation between x and y?

Correlation is given by square root of R-squared

res$r.squared

sqrt(res$r.squared)

Hence correlation between x and y = 0.7372923

What proportion of the total variance is explained by the regression model?

Proportion of the total variance is explained by the regression model is given by R-squared = 0.5436131

res$r.squared

## [1] 0.5436131

Modify a point in the data:

y[3] <- 250
dfin_new <- data.frame(y,x)

This point is now …

(a) none of the above

(b) an influential outlier - an outlier is influential if it changes any aspect of the results of regression significantly. The intercept of our fitted model changed from 4.7595 to 8.2571, hence the point is influential.

fit2 <- lm(formula = 'y~x', data = dfin_new)
summary(fit2)

(c) a high leverage outlier - an outlier is high leverage if it has exceptionally high x value which is not the case in the data point here. We can see this in the plot below as well.

(d) an outlier - a point is outlier if the y value significantly differs from the general trend. We can see from the plot below that is the case. Hence the point is an outlier.

ggplot(dfin, aes(x, y, color = y)) +
geom_point()

(b) and (d) are correct options

--- title: "Solution" output: word_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE) ``` ```{r} set.seed(8) x <- rnorm(100, 5, 10) y <- x*4.3 + 5 + rnorm(100, 0, 40) ``` Theoretical equation used above is of the form $$y = \beta_0 + \beta_1 x + \epsilon = 5 + 4.3x + \epsilon$$ ***** What is the theoretical slope of the population? theoretical slope $\beta_1=4.3$ ***** What is the theoretical intercept of the population? theoretical intercept $\beta_0=5$ ***** Specifying and fitting the linear model in R. ```{r} library(tidyverse) dfin <- data.frame(y,x) fit <- lm(formula = 'y~x', data = dfin) res <- summary(fit) res ``` What is the slope that is inferred from creating a regression model for the randomly generated sample? ```{r} res$coefficients ``` $\hat \beta_1=4.3251$ ***** What is the intercept that is inferred from creating a regression model for the randomly generated sample? $\hat \beta_0=4.759522$ ***** What is the correlation between x and y? Correlation is given by square root of R-squared ```{r} res$r.squared sqrt(res$r.squared) ``` Hence correlation between x and y = 0.7372923 ***** What proportion of the total variance is explained by the regression model? Proportion of the total variance is explained by the regression model is given by R-squared = 0.5436131 ```{r} res$r.squared ``` ***** Modify a point in the data: ```{r} y[3] <- 250 dfin <- data.frame(y,x) ``` This point is now ... (a) none of the above (b) an influential outlier - an outlier is influential if it changes any aspect of the results of regression significantly. The intercept of our fitted model changed from 4.7595 to 8.2571, hence the point is influential. ```{r} fit2 <- lm(formula = 'y~x', data = dfin) summary(fit2) ``` (c) a high leverage outlier - an outlier is high leverage if it has exceptionally high x value which is not the case in the data point here. We can see this in the plot below as well. (d) an outlier - a point is outlier if the y value significantly differs from the general trend. We can see from the plot below that is the case. Hence the point is an outlier. ```{r} ggplot(dfin, aes(x, y, color = y)) + geom_point() ``` (b) and (d) are correct options

Related Solutions

suppose a regression model has two explanatory variables (x and z). If we add a new...
suppose a regression model has two explanatory variables (x and z). If we add a new variable to to the model (m), and this new variable is correlated to x and z, how would we use the new variable m to test the impact of variable x on our dependent variable y when z and m remain the same?
Use moment generating functions to decide whether or not the given random variables X and Y...
Use moment generating functions to decide whether or not the given random variables X and Y are equal in distribution. a). The random variables Z1, Z2, Z3 are independent normal N(0,1), X = Z1 + Z2 + Z3 and Y = √3Z1 b). The random variables Z1, Z2, Z3 are independent Poisson with the same parameterλ, X = 3Z1 and Y=Z1 + Z2 + Z3 c). The random variables Z1, Z2 are independent normal N(0,1), X = Z1 + 2Z2...
1) Generate a data set with three variables (X, Y and Z). X and Y have...
1) Generate a data set with three variables (X, Y and Z). X and Y have 10 observations for each (N=10), and Z has 13 observations (N=13). Each observation should have two digits (such as “83” or “8.3”). 2) Draw a stem-and-leaf display for variable Z only and draw a box plot display for variable Z after specifying the 5 numbers (UEX, LEX, FU, FL, MD). 3) Calculate the mean and standard deviation for variable X 4) Calculate the mean...
Let's say that we have two variables X and Y. We calculate their correlation value to...
Let's say that we have two variables X and Y. We calculate their correlation value to be r = -.8012. What is the interpretation of this value?
Consider the x, y data: x-data (explanatory variables): 10, 15, 20, 25, 30, 35, 40, 45,...
Consider the x, y data: x-data (explanatory variables): 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100 y-data (response variables): 1359.9265, 1353.3046, 220.7435, 964.6208, 1861.9920, 1195.3707, 1702.0145, 2002.0900, 1129.1860, 1864.5241, 1444.2239, 2342.5453, 2410.9056, 2766.2245, 2135.2241, 3113.7662, 4311.7260, 3313.1042, 4072.0945 Compute a best fit line to the data. Report: a. The slope coefficient, β1:   b. The intercept coefficient, β0:    c. The standard error of the residuals σε:   d. The Adjusted...
Consider the x, y data: x-data (explanatory variables): 10, 15, 20, 25, 30, 35, 40, 45,...
Consider the x, y data: x-data (explanatory variables): 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100 y-data (response variables): 1359.9265, 1353.3046, 220.7435, 964.6208, 1861.9920, 1195.3707, 1702.0145, 2002.0900, 1129.1860, 1864.5241, 1444.2239, 2342.5453, 2410.9056, 2766.2245, 2135.2241, 3113.7662, 4311.7260, 3313.1042, 4072.0945 Compute a best fit line to the data. Report: a. The slope coefficient, β1: ___ b. The intercept coefficient, β0: ___ c. The standard error of the residuals σε: ___ d....
In this problem there are two random variables X and Y. The random variable Y counts...
In this problem there are two random variables X and Y. The random variable Y counts how many times we roll the die in the following experiment: First, we flip a fair coin. If it comes Heads we set X= 1 and roll a fair die until we get a six. If it comes Tails, we set X= 0 and roll the die until we get an even number (2, 4 or 6). a). What are the possible values taken...
If I ran a multivariate regression analysis for the effect of independent variables X and Y...
If I ran a multivariate regression analysis for the effect of independent variables X and Y on dependent variable A, that produced an adjusted R^2 of .0553, then added the independent variable Z to the analysis and got an adjusted R^2 of .0550, would that decrease in the adjusted R^2 translate to the independent variable Z not being a strong predictor of the dependent variable A? If it were a strong predictor of A would the adjusted R^2 increase?
how do you test for regression in R studio with variables x and y
how do you test for regression in R studio with variables x and y
Given are five observations for two variables, x and y. The estimated regression equation for these...
Given are five observations for two variables, x and y. The estimated regression equation for these data is y = 1 + 2.4x (A) Compute SSE, SST, and SSR (B) Compute the coefficient of determination, R^2. (C) Compute the sample correlation coefficient. x y 1 3 2 8 3 5 4 12 5 13
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT