Question

In: Statistics and Probability

Data exploration through visualization is important because statistics alone might not tell the entire story. This...

Data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here and show the code:

Using R programming studio, calculate the mean, variance, correlation, and linear regression for each data set (No data partition). Using base R or ggplot2, create a visual representation of this data. What does this visualization show? (show the code for each data set)

Using R studio programming, how do you evaluate and compare the developed linear regression models? Are there any issues with the models built?

Is there any way to improve the linear regression models for Data I, II, and III?

Data I

Data II

Data III

Data IV

x

y

x

y

x

y

x

y

10.0

8.04

10.0

9.14

10.0

7.46

8.0

6.58

8.0

6.95

8.0

8.14

8.0

6.77

8.0

5.76

13.0

7.58

13.0

8.74

13.0

12.74

8.0

7.71

9.0

8.81

9.0

8.77

9.0

7.11

8.0

8.84

11.0

8.33

11.0

9.26

11.0

7.81

8.0

8.47

14.0

9.96

14.0

8.10

14.0

8.84

8.0

7.04

6.0

7.24

6.0

6.13

6.0

6.08

8.0

5.25

4.0

4.26

4.0

3.10

4.0

5.39

19.0

12.50

12.0

10.84

12.0

9.13

12.0

8.15

8.0

5.56

7.0

4.82

7.0

7.26

7.0

6.42

8.0

7.91

5.0

5.68

5.0

4.74

5.0

5.73

8.0

6.89

Solutions

Expert Solution

Solution:

For the above data we have to perform the simple linear regression and data visualization as well.

So,we will be using R-software for this task.

1) Enter the data into excel and save as .csv file

2) Import data from read.csv() file

3) Perform the analysis as given below.

---Below is the R code for our problem

A) For data1

data1<-read.csv(file="data4.csv", header = TRUE)
data1
mean=mean(x)
mean
mean=mean(y)
mean
variance=var(x)
variance
variance=var(y)
variance
correlation=cor(data1$x,data1$y)
correlation

## Data visualization
plot(data1$x,data1$y)
hist(data1$y)
boxplot(data1$y)

##Proceeding to linear regression

lm=lm(y~x,data1)
lm
----Output----

> data1<-read.csv(file="data4.csv", header = TRUE)
> data1
x y
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data1$x,data1$y)
> correlation
[1] 0.8164205
> plot(data1$x,data1$y) ## scatter plot


> hist(data1$y) ## histogram


> boxplot(data1$y) ## boxplot


> lm=lm(y~x,data1)
> lm

Call:
lm(formula = y ~ x, data = data1)

Coefficients:
(Intercept) x
3.0001 0.5001

Hence the regression model will be:-

B) For data2

R code and output

data2<-read.csv(file="data4.csv", header = TRUE)
> data2
x y
1 10 9.14
2 8 8.14
3 13 8.74
4 9 8.77
5 11 9.26
6 14 8.10
7 6 6.13
8 4 3.10
9 12 9.13
10 7 7.26
11 5 4.74
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data2$x,data2$y)
> correlation
[1] 0.8162365
> plot(data2$x,data2$y)


> hist(data2$y)


> boxplot(data2$y)


>
> lm=lm(y~x,data2)
> lm

Call:
lm(formula = y ~ x, data = data2)

Coefficients:
(Intercept) x
3.001 0.500

Hence the regression model will be:-

C) For data3

R code and output

data3<-read.csv(file="data4.csv", header = TRUE)
> data3
x y
1 10 7.46
2 8 6.77
3 13 12.74
4 9 7.11
5 11 7.81
6 14 8.84
7 6 6.08
8 4 5.39
9 12 8.15
10 7 6.42
11 5 5.73
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data3$x,data3$y)
> correlation
[1] 0.8162867
> plot(data3$x,data3$y)


> hist(data3$y)


> boxplot(data3$y)


>
> lm=lm(y~x,data3)
> lm

Call:
lm(formula = y ~ x, data = data3)

Coefficients:
(Intercept) x
3.0025 0.4997

Hence the regression model will be:-

D) For data4

R code and output

> data4<-read.csv(file="data4.csv", header = TRUE)
> data4
x y
1 10 6.58
2 8 5.76
3 13 7.71
4 9 8.84
5 11 8.47
6 14 7.04
7 6 5.25
8 4 12.50
9 12 5.56
10 7 7.91
11 5 6.89
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data4$x,data4$y)
> correlation
[1] -0.3140467
> plot(data4$x,data4$y)


> hist(data4$y)


> boxplot(data4$y)


> lm=lm(y~x,data4)
> lm

Call:
lm(formula = y ~ x, data = data4)

Coefficients:
(Intercept) x
9.2314 -0.1923

Hence the regression model will be:-

Conclusion:-

1) Issues with the models

a) From the data visualization we can see that the x and y in data2 is not actually having the linear relationship between them.So we can not get the correct prediction from simple linear regression here for data2.

b) In all the datasets the response variable is following normal distribution(from boxplot and histogram),hence assumption of regression theory is getting validated.

c) For the model 2 we can use polynomial regression to get better prediction.

d) Regression model in data 2 and 4 can be improved by adding more regressors or higher order of current regressor.

e) Regression model by using data 1 and 3 are perfectly estimable using linear regression.

f) All the above comparison is made using the scatter plot, box plot and histogram.


Related Solutions

Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone...
Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here Data I Data II Data III Data IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58...
Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone...
Question 1: As explained in Lesson 5, data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here Data I Data II Data III Data IV x y x y x y x y 10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58 8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76 13.0 7.58...
What role does editorial thinking play in data visualization? Why is it important or not important...
What role does editorial thinking play in data visualization? Why is it important or not important in your opinion?
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home...
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home values (Home Value), median household income (HH Inc), median per capita (Per Cap Inc) and percent of homes that are owner occupied (Pct Owner Occ) for each state and the District of Columbia. Prior to a more detailed analysis of the data, a company wants to get a good understanding of the 4 variables (e.g. central tendency, variability, shape of the distribution, pattern of...
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home...
Case Study 1 - Data Visualization and Descriptive Statistics The data file Home_Values.xlsx contains median home values (Home Value), median household income (HH Inc), median per capita (Per Cap Inc) and percent of homes that are owner occupied (Pct Owner Occ) for each state and the District of Columbia. Prior to a more detailed analysis of the data, a company wants to get a good understanding of the 4 variables (e.g. central tendency, variability, shape of the distribution, pattern of...
In regards to Statistics: the exploration and and analysis of data (7th edition), chapter 13, question...
In regards to Statistics: the exploration and and analysis of data (7th edition), chapter 13, question 43. Part a of the question asks for the equation of an estimated regression line. The solution is already on chegg, but my question is: why are SSR, Se, and Sb still calculated, after y-hat=2.7...+(0.04...)x has already been solved for?
The role of data will be very important in the implementation of supply chain, because there...
The role of data will be very important in the implementation of supply chain, because there will be three parties: vendor / suppliers, company, and customers. Give one example of a case in the implementation of "global supply chain security" linked to those three parties.
From Statistics and Data Analysis from Elementary to Intermediate by Tamhane and Dunlop, pg 290. Tell...
From Statistics and Data Analysis from Elementary to Intermediate by Tamhane and Dunlop, pg 290. Tell in each of the following instances whether the study uses an independent samples or a matched pairs design. a) Two computing algorithms are compared in terms of the CPU times required to do the same six test problems. b) A survey is conducted of teens from inner city schools and suburban schools to compare the proportion who have tried drugs. c) A psychologist measures...
In business management, how important is it to learn and use statistics and data analysis to...
In business management, how important is it to learn and use statistics and data analysis to analyze trends, patterns, and relationships for making data-driven managerial decisions?
Statistics has an important role in the analysis of data. However, some claim that the more...
Statistics has an important role in the analysis of data. However, some claim that the more important role of statistics is in the design stage when one decides how to collect the data. Good design may improve the chances that the eventual inference of the data will lead to a meaningful and trustworthy conclusion. In many situations different types of measurements can be used in order to investigate a scientific problem. Frequently, one may chose between a more accurate but...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT