In: Statistics and Probability
Problem 2: (Revised 6.3) Magazine Advertising: In a study of revenue from advertising, data were collected for 41 magazines list as follows. The variables observed are number of pages of advertising and advertising revenue. The names of the magazines are listed as:
(use sas)
Adv Revenue
25 50
15 49.7
20 34
17 30.7
23 27
17 26.3
14 24.6
22 16.9
12 16.7
15 14.6
8 13.8
7 13.2
9 13.1
12 10.6
1 8.8
6 8.7
12 8.5
9 8.3
7 8.2
9 8.2
7 7.3
1 7
77 6.6
13 6.2
5 5.8
7 5.1
13 4.1
4 3.9
6 3.9
3 3.5
6 3.3
4 3
3 2.5
3 2.3
5 2.3
4 1.8
4 1.5
3 1.3
3 1.3
4 1
2 0.3
(a)
Loaded the data into magazines dataframe. Below command used to fit the linear regression on the data to predict revenue based on advertising pages.
model = lm(R~P, data = magazines)
> model
Call:
lm(formula = R ~ P, data = magazines)
Coefficients:
(Intercept) P
7.6041 0.3527
Getting the summary of the model R-Sq by below command, we see that R-Sq is 0.1263 which is very low. That is the model only explains 12.63% of variation of advertising revenue. So, the fit is poor.
> summary(model)
Call:
lm(formula = R ~ P, data = magazines)
Residuals:
Min 1Q Median 3Q Max
-28.162 -6.362 -2.773 2.322 36.805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.6041 2.4061 3.160 0.00304 **
P 0.3527 0.1486 2.374 0.02262 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 11.6 on 39 degrees of freedom
Multiple R-squared: 0.1263, Adjusted
R-squared: 0.1039
F-statistic: 5.636 on 1 and 39 DF, p-value: 0.02262
By running the below command, we can further evaluate the fit. Residual vs Fitted shows a pattern (all data points are concentrated at one place), which shows that fit is bad and there could be a non-linear relationship between predictor variables and an outcome variable.
> op <- par(mfrow=c(2,2),mar=c(2,3,1.5,0.5))
> plot(model)
(b)
Plotting the advertising pages on x axis and advertising revenue on y-axis, we see that the relationship is not linear and advertising revenue rises sharply with advertising pages. Also, there are lot of outliers in the data.
plot(magazines$P,magazines$R, xlab = "Advertising Pages", ylab = "Advertising Revenue", pch = 16)
Transforming the variable P and R to log(P) and log(R), we get the r-squared of the model as 0.4203
> summary(lm(log(R)~log(P), data =
magazines))$r.squared
[1] 0.420323
Although the model is somewhat improved from part (a) but the fit is still average.
(c)
We will delete the outliers to further improve the model.
First, we will calculate the outlier for the advertising revenue (R).
> summary(magazines$R)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.30 3.30 7.30 11.36 13.80 50.00
> IQR = 13.80 - 3.30
> 13.80 + 1.5 * IQR
[1] 29.55
So, any advertising revenue greater than 29.55 is considered as outlier.
We will delete the below entries from the dataframe.
Magazine P R
1 Cosmopolitan 25 50.0
2 Redbook 15 49.7
3 Glamour 20 34.0
4 SouthernLiving 17 30.7
Similarly, we will delete the entry for the outlier of advertising pages.
23 TrueStory 77 6.6
We will get the data in new dataframe magazines.new
> magazines.new = magazines[c(-1:-4,-23),]
Running the regression after deleting the outliers, we get R-sq as 63.44% which is a good fit.
> summary(lm(R~P, data = magazines.new))$r.squared
[1] 0.6344904