In: Statistics and Probability
Data exploration through visualization is important because statistics alone might not tell the entire story. This is best shown by the French statistician Francis Anscombe in 1973 when he presented four sets of data. This data is shown here and show the code:
Using R programming studio, calculate the mean, variance, correlation, and linear regression for each data set (No data partition). Using base R or ggplot2, create a visual representation of this data. What does this visualization show? (show the code for each data set)
Using R studio programming, how do you evaluate and compare the developed linear regression models? Are there any issues with the models built?
Is there any way to improve the linear regression models for Data I, II, and III?
Data I |
Data II |
Data III |
Data IV |
||||
x |
y |
x |
y |
x |
y |
x |
y |
10.0 |
8.04 |
10.0 |
9.14 |
10.0 |
7.46 |
8.0 |
6.58 |
8.0 |
6.95 |
8.0 |
8.14 |
8.0 |
6.77 |
8.0 |
5.76 |
13.0 |
7.58 |
13.0 |
8.74 |
13.0 |
12.74 |
8.0 |
7.71 |
9.0 |
8.81 |
9.0 |
8.77 |
9.0 |
7.11 |
8.0 |
8.84 |
11.0 |
8.33 |
11.0 |
9.26 |
11.0 |
7.81 |
8.0 |
8.47 |
14.0 |
9.96 |
14.0 |
8.10 |
14.0 |
8.84 |
8.0 |
7.04 |
6.0 |
7.24 |
6.0 |
6.13 |
6.0 |
6.08 |
8.0 |
5.25 |
4.0 |
4.26 |
4.0 |
3.10 |
4.0 |
5.39 |
19.0 |
12.50 |
12.0 |
10.84 |
12.0 |
9.13 |
12.0 |
8.15 |
8.0 |
5.56 |
7.0 |
4.82 |
7.0 |
7.26 |
7.0 |
6.42 |
8.0 |
7.91 |
5.0 |
5.68 |
5.0 |
4.74 |
5.0 |
5.73 |
8.0 |
6.89 |
Solution:
For the above data we have to perform the simple linear regression and data visualization as well.
So,we will be using R-software for this task.
1) Enter the data into excel and save as .csv file
2) Import data from read.csv() file
3) Perform the analysis as given below.
---Below is the R code for our problem
A) For data1
data1<-read.csv(file="data4.csv", header = TRUE)
data1
mean=mean(x)
mean
mean=mean(y)
mean
variance=var(x)
variance
variance=var(y)
variance
correlation=cor(data1$x,data1$y)
correlation
## Data visualization
plot(data1$x,data1$y)
hist(data1$y)
boxplot(data1$y)
##Proceeding to linear regression
lm=lm(y~x,data1)
lm
----Output----
> data1<-read.csv(file="data4.csv", header = TRUE)
> data1
x y
1 10 8.04
2 8 6.95
3 13 7.58
4 9 8.81
5 11 8.33
6 14 9.96
7 6 7.24
8 4 4.26
9 12 10.84
10 7 4.82
11 5 5.68
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data1$x,data1$y)
> correlation
[1] 0.8164205
> plot(data1$x,data1$y) ## scatter plot
> hist(data1$y) ## histogram
> boxplot(data1$y) ## boxplot
> lm=lm(y~x,data1)
> lm
Call:
lm(formula = y ~ x, data = data1)
Coefficients:
(Intercept) x
3.0001 0.5001
Hence the regression model will be:-
B) For data2
R code and output
data2<-read.csv(file="data4.csv", header = TRUE)
> data2
x y
1 10 9.14
2 8 8.14
3 13 8.74
4 9 8.77
5 11 9.26
6 14 8.10
7 6 6.13
8 4 3.10
9 12 9.13
10 7 7.26
11 5 4.74
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data2$x,data2$y)
> correlation
[1] 0.8162365
> plot(data2$x,data2$y)
> hist(data2$y)
> boxplot(data2$y)
>
> lm=lm(y~x,data2)
> lm
Call:
lm(formula = y ~ x, data = data2)
Coefficients:
(Intercept) x
3.001 0.500
Hence the regression model will be:-
C) For data3
R code and output
data3<-read.csv(file="data4.csv", header = TRUE)
> data3
x y
1 10 7.46
2 8 6.77
3 13 12.74
4 9 7.11
5 11 7.81
6 14 8.84
7 6 6.08
8 4 5.39
9 12 8.15
10 7 6.42
11 5 5.73
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data3$x,data3$y)
> correlation
[1] 0.8162867
> plot(data3$x,data3$y)
> hist(data3$y)
> boxplot(data3$y)
>
> lm=lm(y~x,data3)
> lm
Call:
lm(formula = y ~ x, data = data3)
Coefficients:
(Intercept) x
3.0025 0.4997
Hence the regression model will be:-
D) For data4
R code and output
> data4<-read.csv(file="data4.csv", header = TRUE)
> data4
x y
1 10 6.58
2 8 5.76
3 13 7.71
4 9 8.84
5 11 8.47
6 14 7.04
7 6 5.25
8 4 12.50
9 12 5.56
10 7 7.91
11 5 6.89
> mean=mean(x)
> mean
[1] 69.91667
> mean=mean(y)
> mean
[1] 0.5
> variance=var(x)
> variance
[1] 54.42754
> variance=var(y)
> variance
[1] 0.2631579
> correlation=cor(data4$x,data4$y)
> correlation
[1] -0.3140467
> plot(data4$x,data4$y)
> hist(data4$y)
> boxplot(data4$y)
> lm=lm(y~x,data4)
> lm
Call:
lm(formula = y ~ x, data = data4)
Coefficients:
(Intercept) x
9.2314 -0.1923
Hence the regression model will be:-
Conclusion:-
1) Issues with the models
a) From the data visualization we can see that the x and y in data2 is not actually having the linear relationship between them.So we can not get the correct prediction from simple linear regression here for data2.
b) In all the datasets the response variable is following normal distribution(from boxplot and histogram),hence assumption of regression theory is getting validated.
c) For the model 2 we can use polynomial regression to get better prediction.
d) Regression model in data 2 and 4 can be improved by adding more regressors or higher order of current regressor.
e) Regression model by using data 1 and 3 are perfectly estimable using linear regression.
f) All the above comparison is made using the scatter plot, box plot and histogram.