Question

In: Statistics and Probability

Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone...

Question 1:

As explained in Lesson 5, data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here

Data I

Data II

Data III

Data IV

x

y

x

y

x

y

x

y

10.0

8.04

10.0

9.14

10.0

7.46

8.0

6.58

8.0

6.95

8.0

8.14

8.0

6.77

8.0

5.76

13.0

7.58

13.0

8.74

13.0

12.74

8.0

7.71

9.0

8.81

9.0

8.77

9.0

7.11

8.0

8.84

11.0

8.33

11.0

9.26

11.0

7.81

8.0

8.47

14.0

9.96

14.0

8.10

14.0

8.84

8.0

7.04

6.0

7.24

6.0

6.13

6.0

6.08

8.0

5.25

4.0

4.26

4.0

3.10

4.0

5.39

19.0

12.50

12.0

10.84

12.0

9.13

12.0

8.15

8.0

5.56

7.0

4.82

7.0

7.26

7.0

6.42

8.0

7.91

5.0

5.68

5.0

4.74

5.0

5.73

8.0

6.89

Calculate the mean, variance, correlation, and linear regression for each data set (No data partition). Using base R or ggplot2, create a visual representation of this data. What does this visualization show?

Question 2:

  • How do you evaluate and compare the developed linear regression models in Question 1? Are there any issues with the models you built?
  • Is there any way to improve your linear regression models for Data I, II, and III?

Solutions

Expert Solution

For the above data we have to perform the simple linear regression and data visualization as well.

So,we will be using R-software for this task.

1) Enter the data into excel and save as .csv file

2) Import data from read.csv() file

3) Perform the analysis as given below.

---Below is the R code for our problem

A) For data1

data1<-read.csv(file="data4.csv", header = TRUE)
data1
mean=mean(x)
mean
mean=mean(y)
mean
variance=var(x)
variance
variance=var(y)
variance
correlation=cor(data1$x,data1$y)
correlation

## Data visualization
plot(data1$x,data1$y)
hist(data1$y)
boxplot(data1$y)

##Proceeding to linear regression

lm=lm(y~x,data1)
lm
----Output----

> data1<-read.csv(file="data4.csv", header = TRUE)
> data1
x y
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data1$x,data1$y)
> correlation
[1] 0.8164205
> plot(data1$x,data1$y) ## scatter plot


> hist(data1$y) ## histogram


> boxplot(data1$y) ## boxplot


> lm=lm(y~x,data1)
> lm

Call:
lm(formula = y ~ x, data = data1)

Coefficients:
(Intercept) x
3.0001 0.5001

Hence the regression model will be:-

B) For data2

R code and output

data2<-read.csv(file="data4.csv", header = TRUE)
> data2
x y
1 10 9.14
2 8 8.14
3 13 8.74
4 9 8.77
5 11 9.26
6 14 8.10
7 6 6.13
8 4 3.10
9 12 9.13
10 7 7.26
11 5 4.74
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data2$x,data2$y)
> correlation
[1] 0.8162365
> plot(data2$x,data2$y)


> hist(data2$y)


> boxplot(data2$y)


>
> lm=lm(y~x,data2)
> lm

Call:
lm(formula = y ~ x, data = data2)

Coefficients:
(Intercept) x
3.001 0.500

Hence the regression model will be:-

C) For data3

R code and output

data3<-read.csv(file="data4.csv", header = TRUE)
> data3
x y
1 10 7.46
2 8 6.77
3 13 12.74
4 9 7.11
5 11 7.81
6 14 8.84
7 6 6.08
8 4 5.39
9 12 8.15
10 7 6.42
11 5 5.73
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data3$x,data3$y)
> correlation
[1] 0.8162867
> plot(data3$x,data3$y)


> hist(data3$y)


> boxplot(data3$y)


>
> lm=lm(y~x,data3)
> lm

Call:
lm(formula = y ~ x, data = data3)

Coefficients:
(Intercept) x
3.0025 0.4997

Hence the regression model will be:-

D) For data4

R code and output

> data4<-read.csv(file="data4.csv", header = TRUE)
> data4
x y
1 10 6.58
2 8 5.76
3 13 7.71
4 9 8.84
5 11 8.47
6 14 7.04
7 6 5.25
8 4 12.50
9 12 5.56
10 7 7.91
11 5 6.89
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data4$x,data4$y)
> correlation
[1] -0.3140467
> plot(data4$x,data4$y)


> hist(data4$y)


> boxplot(data4$y)


> lm=lm(y~x,data4)
> lm

Call:
lm(formula = y ~ x, data = data4)

Coefficients:
(Intercept) x
9.2314 -0.1923

Hence the regression model will be:-

Conclusion:-

1) Issues with the models

a) From the data visualization we can see that the x and y in data2 is not actually having the linear relationship between them.So we can not get the correct prediction from simple linear regression here for data2.

b) In all the datasets the response variable is following normal distribution(from boxplot and histogram),hence assumption of regression theory is getting validated.

c) For the model 2 we can use polynomial regression to get better prediction.

d) Regression model in data 2 and 4 can be improved by adding more regressors or higher order of current regressor.

e) Regression model by using data 1 and 3 are perfectly estimable using linear regression.

f) All the above comparison is made using the scatter plot, box plot and histogram.


Related Solutions

Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone...
Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here Data I Data II Data III Data IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58...
Data exploration through visualization is important because statistics alone might not tell the entire story. This...
Data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here and show the code: Using R programming studio, calculate the mean, variance, correlation, and linear regression for each data set (No data partition). Using base R or ggplot2, create a visual representation of this data. What does this visualization show? (show...
In regards to Statistics: the exploration and and analysis of data (7th edition), chapter 13, question...
In regards to Statistics: the exploration and and analysis of data (7th edition), chapter 13, question 43. Part a of the question asks for the equation of an estimated regression line. The solution is already on chegg, but my question is: why are SSR, Se, and Sb still calculated, after y-hat=2.7...+(0.04...)x has already been solved for?
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home...
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home values (Home Value), median household income (HH Inc), median per capita (Per Cap Inc) and percent of homes that are owner occupied (Pct Owner Occ) for each state and the District of Columbia. Prior to a more detailed analysis of the data, a company wants to get a good understanding of the 4 variables (e.g. central tendency, variability, shape of the distribution, pattern of...
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home...
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home values (Home Value), median household income (HH Inc), median per capita (Per Cap Inc) and percent of homes that are owner occupied (Pct Owner Occ) for each state and the District of Columbia. Prior to a more detailed analysis of the data, a company wants to get a good understanding of the 4 variables (e.g. central tendency, variability, shape of the distribution, pattern of...
What role does editorial thinking play in data visualization? Why is it important or not important...
What role does editorial thinking play in data visualization? Why is it important or not important in your opinion?
Data Analysis & Visualization subject Question 1. What does "R is a vectorized language" mean? Question...
Data Analysis & Visualization subject Question 1. What does "R is a vectorized language" mean? Question 2. What is unexplainable variance? Question 3. What is confusion matrix? A) What is model precision in a confusion matrix? B) What is model recall in a confusion matrix? C) Consider the following confusion matrix of daily movements of a stock market. ACTUAL Down Up PREDICTED Down 30 30 Up 70 110 i) Compute the precision of the model. ii) Compute the recall of...
DISCUSSION QUESTION 5-1 MB670 Project Management Lesson 5: Risk Mitigation Discussion Question 1 (50 points) Describe...
DISCUSSION QUESTION 5-1 MB670 Project Management Lesson 5: Risk Mitigation Discussion Question 1 (50 points) Describe what needs to be done to manage risk on a project. When should this be done? How can a risk assessment matrix help in this process? What risks for a project have the highest priority? Does the priority for a risk change as the project progresses? (50 points)  (A 2-page response is required.)
Data Visualization questions; 1. Please discuss briefly (50 words) tow datincts applications where data visualization can...
Data Visualization questions; 1. Please discuss briefly (50 words) tow datincts applications where data visualization can be used. 2. In about 50 words each, please comment on three distinct aspects of data visualization about which you learned in this topic. Linear Regression Questions; in 50 words each, discuss three distinct applications where linear regression could be used effectively. explain why each application is a good candidate for profuctive use of linear regression tools.
10.8 LAB*: Program: Data visualization (1) Prompt the user for a title for data. Output the...
10.8 LAB*: Program: Data visualization (1) Prompt the user for a title for data. Output the title. (1 pt) Ex: Enter a title for the data: Number of Novels Authored You entered: Number of Novels Authored (2) Prompt the user for the headers of two columns of a table. Output the column headers. (1 pt) Ex: Enter the column 1 header: Author name You entered: Author name Enter the column 2 header: Number of novels You entered: Number of novels...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT