Question

In: Statistics and Probability

Problem 2: (Revised 6.3) Magazine Advertising: In a study of revenue from advertising, data were collected...

Problem 2: (Revised 6.3) Magazine Advertising: In a study of revenue from advertising, data were collected for 41 magazines list as follows. The variables observed are number of pages of advertising and advertising revenue. The names of the magazines are listed as:

(use sas)

Adv Revenue

25 50

15 49.7

20 34

17 30.7

23 27

17 26.3

14 24.6

22 16.9

12 16.7

15 14.6

8 13.8

7 13.2

9 13.1

12 10.6

1 8.8

6 8.7

12 8.5

9 8.3

7 8.2

9 8.2

7 7.3

1 7

77 6.6

13 6.2

5 5.8

7 5.1

13 4.1

4 3.9

6 3.9

3 3.5

6 3.3

4 3

3 2.5

3 2.3

5 2.3

4 1.8

4 1.5

3 1.3

3 1.3

4 1

2 0.3

  1. Fit a linear regression equation relating advertising revenue to advertising pages. Verify that the fit is poor.
  2. Choose an appropriate transformation of the data and fit the model to the transformed data. Evaluate the fit.
  3. You should not be surprised by the presence of a large number of outliers because the magazines are highly heterogeneous and it is unrealistic to expect a single relationship to connect all of them. Find outliers and high leverage points. Delete the outliers and obtain an acceptable regression equation that relates advertising revenue to advertising pages.
  4. For the deleted data, check the homogeneity of the variance. Choose an appropriate transformation of the data and fit the model to the transformed data. Evaluate the fit.

Solutions

Expert Solution

(a)

Loaded the data into magazines dataframe. Below command used to fit the linear regression on the data to predict revenue based on advertising pages.

model = lm(R~P, data = magazines)

> model

Call:
lm(formula = R ~ P, data = magazines)

Coefficients:
(Intercept) P
7.6041 0.3527

Getting the summary of the model R-Sq by below command, we see that R-Sq is 0.1263 which is very low. That is the model only explains 12.63% of variation of advertising revenue. So, the fit is poor.

> summary(model)

Call:
lm(formula = R ~ P, data = magazines)

Residuals:
Min 1Q Median 3Q Max
-28.162 -6.362 -2.773 2.322 36.805

Coefficients:
Estimate Std. Error t value Pr(>|t|)   
(Intercept) 7.6041 2.4061 3.160 0.00304 **
P 0.3527 0.1486 2.374 0.02262 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 11.6 on 39 degrees of freedom
Multiple R-squared: 0.1263,   Adjusted R-squared: 0.1039
F-statistic: 5.636 on 1 and 39 DF, p-value: 0.02262

By running the below command, we can further evaluate the fit. Residual vs Fitted shows a pattern (all data points are concentrated at one place), which shows that fit is bad and there could be a non-linear relationship between predictor variables and an outcome variable.

> op <- par(mfrow=c(2,2),mar=c(2,3,1.5,0.5))
> plot(model)

(b)

Plotting the advertising pages on x axis and advertising revenue on y-axis, we see that the relationship is not linear and advertising revenue rises sharply with advertising pages. Also, there are lot of outliers in the data.

plot(magazines$P,magazines$R, xlab = "Advertising Pages", ylab = "Advertising Revenue", pch = 16)

Transforming the variable P and R to log(P) and log(R), we get the r-squared of the model as 0.4203

> summary(lm(log(R)~log(P), data = magazines))$r.squared
[1] 0.420323

Although the model is somewhat improved from part (a) but the fit is still average.

(c)

We will delete the outliers to further improve the model.

First, we will calculate the outlier for the advertising revenue (R).

> summary(magazines$R)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.30 3.30 7.30 11.36 13.80 50.00
> IQR = 13.80 - 3.30
> 13.80 + 1.5 * IQR
[1] 29.55

So, any advertising revenue greater than 29.55 is considered as outlier.

We will delete the below entries from the dataframe.

Magazine P R
1 Cosmopolitan 25 50.0
2 Redbook 15 49.7
3 Glamour 20 34.0
4 SouthernLiving 17 30.7

Similarly, we will delete the entry for the outlier of advertising pages.

23 TrueStory 77 6.6

We will get the data in new dataframe magazines.new

> magazines.new = magazines[c(-1:-4,-23),]

Running the regression after deleting the outliers, we get R-sq as 63.44% which is a good fit.

> summary(lm(R~P, data = magazines.new))$r.squared
[1] 0.6344904


Related Solutions

USE R AND SHOW CODES 2. The following data were collected in a multisite observational study...
USE R AND SHOW CODES 2. The following data were collected in a multisite observational study of medical effectiveness in Type II diabetes. These sites were involved: a healthy maintenance organization (HMO), a university teaching hospital (UTH), and an independent practice assumption (IPA). The following data display the treatment regimens of patients measured at baseline by site. Use the data to test that no difference in treatment regimens across sites. (in addition, calculate the expected frequency for each cell.)                                                              ...
The data in the table below were collected from a repeated‑measures study of muscle growth related...
The data in the table below were collected from a repeated‑measures study of muscle growth related to exercise. There were 8 people in the study altogether! They each got the four exercise conditions. Fill in the table below Interval training Aerobic exercise Weight training No exercise SSwithin for the study SSwithin for each group 36 18 26 15 95 Source SS DF MS F Between groups Within groups 95 ----- ---- Between persons 16 ----- ---- Error ---- Total 112...
The following data were collected from a repeated-measures study: Determine if there are any significant differences...
The following data were collected from a repeated-measures study: Determine if there are any significant differences among the four treatments. Use a .05 level of significance. Remember to; 1) State the null hypothesis, 2) Show all of your calculations, 3) Make a decision about your null hypothesis, 4) Make a conclusion including an APA format summary of your findings (include a measure of effect size if necessary), and 5) Indicate what you would do next given your findings. Participant Treatments...
1) The following data were collected from a repeated-measures study investigating the effects of 4 treatment...
1) The following data were collected from a repeated-measures study investigating the effects of 4 treatment conditions on test performance. Determine if there are any significant differences among the four treatments. State the null hypothesis. If you determine a significant treatment effect, use Tukey’s HSD test (overall α = .05) to determine which treatments differ from which other treatments. Also, compute the percentage of variance explained by the treatment effect (η2). Conclude with an appropriate summary describing what you found....
To study the physical fitness of a sample of 28 people, the data below were collected...
To study the physical fitness of a sample of 28 people, the data below were collected representing the number of sit-ups that a person could do in one minute. 42, 70, 81, 48, 40, 63, 58, 54, 29, 66, 49, 48, 76, 42, 65, 57, 46, 57, 55, 60, 34, 40, 32, 27, 40, 9, 68, 120 Determine the lower and upper fences. Are there any outliers according to this criterion? This needs to be completed by hand and show...
2. Data concerning employment status were collected from a sample of 50 World Campus students. In...
2. Data concerning employment status were collected from a sample of 50 World Campus students. In that sample of 50 students, 33 students reported they were employed full-time. A. Use Minitab Express to construct a 95% confidence interval to estimate the proportion of all World Campus students who are employed full-time. If assumptions were met, use the normal approximation method. Remember to include all relevant Minitab Express output and to clearly identify your answer. [15 points] B. What sample size...
​Wellness, a healthy living​ magazine, collected $ 540, 000 in subscription revenue on May 31. Each...
​Wellness, a healthy living​ magazine, collected $ 540, 000 in subscription revenue on May 31. Each subscriber will receive an issue of the magazine in each of the next 12​ months, beginning with the June issue. The company uses the accrual method of accounting. What is the balance in the Unearned Revenue account as of December​ 31?
The following data were collected as part of a study of coffee consumption among graduate students....
The following data were collected as part of a study of coffee consumption among graduate students. The following reflect cups per day consumed: 3          4          6          8          2          1          0          2 X X2 0 0 1 1 2 4 2 4 3 9 4 16 6 36 8 64 26 134 Compute the sample mean. Compute the sample standard deviation. Compute the median. Compute the first and third quartiles. Which measure, the mean or median, is a better measure of...
Revenue Cycle Management Data is collected at each step of the revenue cycle, and an error...
Revenue Cycle Management Data is collected at each step of the revenue cycle, and an error or lack of action at any step in the cycle may result in delayed or lost revenue. Discuss three steps in the revenue cycle, explaining what action occurs; provide an example for each step. Describe a negative result, for each of your selected three steps, which may occur if the action is completed incorrectly or not at all. Select one impact, from those you...
In a study of environmental lead exposure and IQ, the data was collected from 148 children...
In a study of environmental lead exposure and IQ, the data was collected from 148 children in Boston, Massachusetts. Their IQ scores at age of 10 approximately follow a normal distribution with mean of 115.9 and standard deviation of 14.2.Suppose one child had an IQ of 74. The researchers would like to know whether an IQ of 74 is an outlier or not. Calculate the lower fence for the IQ data, which is the lower limit value that the IQ...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT