In: Statistics and Probability
A motion picture industry analyst is studying movies based on epic novels. The following data were obtained for 10 Hollywood movies made in the past five years. Each movie was based on an epic novel. For these data, x1 = first-year box office receipts of the movie, x2 = total production costs of the movie, x3 = total promotional costs of the movie, and x4 = total book sales prior to movie release. All units are in millions of dollars.
x1 | x2 | x3 | x4 |
85.1 | 8.5 | 5.1 | 4.7 |
106.3 | 12.9 | 5.8 | 8.8 |
50.2 | 5.2 | 2.1 | 15.1 |
130.6 | 10.7 | 8.4 | 12.2 |
54.8 | 3.1 | 2.9 | 10.6 |
30.3 | 3.5 | 1.2 | 3.5 |
79.4 | 9.2 | 3.7 | 9.7 |
91.0 | 9.0 | 7.6 | 5.9 |
135.4 | 15.1 | 7.7 | 20.8 |
89.3 | 10.2 | 4.5 | 7.9 |
(a) Generate summary statistics, including the mean and standard deviation of each variable. Compute the coefficient of variation for each variable. (Use 2 decimal places.)
x | s | CV | |
x1 | % | ||
x2 | % | ||
x3 | % | ||
x4 | % |
Relative to its mean, which variable has the largest spread of data values?
x4
x3
x2
x1
Why would a variable with a large coefficient of variation be
expected to change a lot relative to its average value? Although
x1 has the largest standard deviation, it has
the smallest coefficient of variation. How does the mean of
x1 help explain this?
A variable with a large CV has large s relative to x. Here, x1 has a small CV because we divide by a small mean.
A variable with a large CV has large s relative to x. Here, x1 has a small CV because we divide by a large mean.
A variable with a large CV has small s relative to x. Here, x1 has a small CV because we divide by a large mean
.A variable with a large CV has small s relative to x. Here, x1 has a small CV because we divide by a small mean.
(b) For each pair of variables, generate the correlation
coefficient r. Compute the corresponding coefficient of
determination r2. (Use 3 decimal places.)
r | r2 | |
x1, x2 | ||
x1, x3 | ||
x1, x4 | ||
x2, x3 | ||
x2, x4 | ||
x3, x4 |
Which of the three variables x2, x3, and x4 has the least influence on box office receipts?
x4
x3
x2
What percent of the variation in box office receipts can be
attributed to the corresponding variation in production costs? (Use
1 decimal place.)
_________________%
(c) Perform a regression analysis with x1 as
the response variable. Use x2,
x3, and x4 as explanatory
variables. Look at the coefficient of multiple determination. What
percentage of the variation in x1 can be
explained by the corresponding variations in
x2, x3, and
x4 taken together? (Use 1 decimal place.)
_____________ %
(d) Write out the regression equation. (Use 2 decimal places.)
x1 = | + x2 | + x3 | + x4 |
Explain how each coefficient can be thought of as a slope.
If we hold all explanatory variables as fixed constants, the intercept can be thought of as a "slope."
If we look at all coefficients together, each one can be thought of as a "slope."
If we look at all coefficients together, the sum of them can be thought of as the overall "slope" of the regression line.
If we hold all other explanatory variables as fixed constants, then we can look at one coefficient as a "slope."
If x2 (production costs) and
x4 (book sales) were held fixed but
x3 (promotional costs) were increased by 0.6
million dollars, what would you expect for the corresponding change
in x1 (box office receipts)? (Use 2 decimal
places.)________________
(e) Test each coefficient in the regression equation to determine
if it is zero or not zero. Use level of significance 5%. (Use 2
decimal places for t and 3 decimal places for the
P-value.)
t | P-value | |
β2 | ||
β3 | ||
β4 |
Conclusion
Reject the null for β3 and β4. Fail to reject the null for β2.
Reject the null for β2 and β3. Fail to reject the null for β4.
Reject the null for all tests.
Reject the null for β2 and β4. Fail to reject the null for β3.
Explain why book sales x4 probably are not
contributing much information in the regression model to forecast
box office receipts x1.
From the previous tests, we can conclude that the coefficient for x4 is different than 0. Thus it does not belong in the model.
From the previous tests, we can conclude that the coefficient for x4 is not different than 0. Thus it does not belong in the model.
From the previous tests, we can conclude that the coefficient for x4 is different than 0. Thus it belongs in the model.
From the previous tests, we can conclude that the coefficient for x4 is not different than 0. Thus it belongs in the model.
(f) Find a 90% confidence interval for each coefficient. (Use 2
decimal places.)
lower limit | upper limit | |
β2 | ||
β3 | ||
β4 |
(g) Suppose a new movie (based on an epic novel) has just been
released. Production costs were x2 = 11.4
million; promotion costs were x3 = 4.7 million;
book sales were x4 = 8.1 million. Make a
prediction for x1 = first-year box office
receipts and find an 85% confidence interval for your prediction
(if your software supports prediction intervals). (Use 1 decimal
place.)
prediction | |
lower limit | |
upper limit |
(h) Construct a new regression model with x3 as
the response variable and x1,
x2, and x4 as explanatory
variables. (Use 2 decimal places.)
x3 = | + x1 | + x2 | + x4 |
Suppose Hollywood is planning a new epic movie with projected box
office sales x1 = 100 million and production
costs x2 = 12 million. The book on which the
movie is based had sales of x4 = 9.2 million.
Forecast the dollar amount (in millions) that should be budgeted
for promotion costs x3 and find an 80%
confidence interval for your prediction.
prediction | |
lower limit | |
upper limit |
[Used R-Software]
(a)
(See R-commands for computation)
xbar | s | cv | |
x1 | 85.24 | 33.79 | 39.64% |
x2 | 8.74 | 3.89 | 44.45% |
x3 | 4.9 | 2.48 | 50.62% |
x4 | 9.92 | 5.17 | 52.15% |
Relative to its mean, which variable has the largest spread of data values? The variable that has highest value of coefficient of variation i.e. x4 has the largest spread of data values.
Why would a variable with a large coefficient of variation be
expected to change a lot relative to its average value? Although x1
has the largest standard deviation, it has the smallest coefficient
of variation. How does the mean of x1 help explain this?
A variable with a large CV has small
s relative to x. Here,
x1 has a small CV because we divide by a large
mean.
(b)
For each pair of variables, generate the correlation coefficient r. Compute the corresponding coefficient of determination r2. (See R-commands for computation)
r | r2 | |
x1, x2 | 0.917 | 0.842 |
x1, x3 | 0.93 | 0.865 |
x1, x4 | 0.475 | 0.225 |
x2, x3 | 0.79 | 0.624 |
x2, x4 | 0.429 | 0.184 |
x3, x4 | 0.299 | 0.089 |
Which of the three variables x2, x3, and x4 has the least influence on box office receipts? Correlation between x1 and x4 is the least. Therefore, x4 has the least influence on box office receipts.
What percent of the variation in box office receipts can be attributed to the corresponding variation in production costs? (Use 1 decimal place.) 91.7% (See R-commands for computation)
(c)
Perform a regression analysis with x1 as the response variable. Use x2, x3, and x4 as explanatory variables. Look at the coefficient of multiple determination. What percentage of the variation in x1 can be explained by the corresponding variations in x2, x3, and x4 taken together? (Use 1 decimal place.)
Coefficient of multiple determination will give percentage of the variation in x1 that can be explained by the corresponding variations in x2, x3, and x4 taken together. Thus, required percentage is 96.7%. (See R-commands for computation)
(d)
# Thus, regression equation is: x1=7.68 + 3.66*x2 + 7.62*x3 + 0.83*x4 (See R-commands for computation)
Explain how each coefficient can be thought of as a slope.
If we hold all other explanatory variables as fixed
constants, then we can look at one coefficient as a
"slope." As slope is change (rate of change) in one
variable when there is unit change in another variable.
If x2 (production costs) and x4 (book sales) were held fixed but x3 (promotional costs) were increased by 0.6 million dollars, what would you expect for the corresponding change in x1 (box office receipts)? (Use 2 decimal places.)
# Thus, the new regression equation is: x1=3.10 + 3.66*x2 + 7.62*x3 + 0.83*x4
Only intercept changes when above situation appears. Betas remains the same.
R-commands and outputs:
x1=c(85.1,106.3,50.2,130.6,54.8,30.3,79.4,91.0,135.4,89.3)
x2=c(8.5,12.9,5.2,10.7,3.1,3.5,9.2,9.0,15.1,10.2)
x3=c(5.1,5.8,2.1,8.4,2.9,1.2,3.7,7.6,7.7,4.5)
x4=c(4.7,8.8,15.1,12.2,10.6,3.5,9.7,5.9,20.8,7.9)
#(a)
summary(x1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
30.30 60.95 87.20 85.24 102.47 135.40
mean(x1)
sd(x1)
cvx1=sd(x1)/mean(x1)
cvx1
summary(x2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.100 6.025 9.100 8.740 10.575 15.100
mean(x2)
sd(x2)
cvx2=sd(x2)/mean(x2)
cvx2
summary(x3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.20 3.10 4.80 4.90 7.15 8.40
mean(x3)
sd(x3)
cvx3=sd(x3)/mean(x3)
cvx3
summary(x4)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.50 6.40 9.25 9.92 11.80 20.80
mean(x4)
sd(x4)
cvx4=sd(x4)/mean(x4)
cvx4
xibar=c(mean(x1),mean(x2),mean(x3),mean(x4))
xibar
[1] 85.24 8.74 4.90 9.92
si=c(sd(x1),sd(x2),sd(x3),sd(x4))
si
[1] 33.786361 3.885357 2.480143 5.173393
round(si,2)
[1] 33.79 3.89 2.48 5.17
cvxi=si/xibar
cvxi
[1] 0.3963675 0.4445489 0.5061517 0.5215114
cvxip=cvxi*100
cvxip
[1] 39.63675 44.45489 50.61517 52.15114
round(cvxip,2)
[1] 39.64 44.45 50.62 52.15
#(b)
cor(x1,x2)
[1] 0.9174448
cor(x1,x3)
[1] 0.9299678
cor(x1,x4)
[1] 0.4746911
cor(x2,x3)
[1] 0.7899575
cor(x2,x4)
[1] 0.4291329
cor(x3,x4)
[1] 0.2987613
## Crosscheck:
sum((x3-mean(x3))*(x4-mean(x4))/(10-1))/(sd(x3)*sd(x4))
[1] 0.2987613
round(cor(x1,x2),3)
[1] 0.917
round(cor(x1,x3),3)
[1] 0.93
round(cor(x1,x4),3)
[1] 0.475
round(cor(x2,x3),3)
[1] 0.79
round(cor(x2,x4),3)
[1] 0.429
round(cor(x3,x4),3)
[1] 0.299
round(cor(x1,x2)^2,3)
[1] 0.842
round(cor(x1,x3)^2,3)
[1] 0.865
round(cor(x1,x4)^2,3)
[1] 0.225
round(cor(x2,x3)^2,3)
[1] 0.624
round(cor(x2,x4)^2,3)
[1] 0.184
round(cor(x3,x4)^2,3)
[1] 0.089
round(100*cor(x1,x2),1)
[1] 91.7
#(c)
# x1=response
fit=lm(x1~x2+x3+x4)
fit
Call:
lm(formula = x1 ~ x2 + x3 + x4)
Coefficients:
(Intercept) x2 x3 x4
7.6760 3.6616 7.6211 0.8285
s=summary(fit)
s
Call:
lm(formula = x1 ~ x2 + x3 + x4)
Residuals:
Min 1Q Median 3Q Max
-12.4384 -3.1695 0.8499 3.5134 9.6207
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.6760 6.7602 1.135 0.2995
x2 3.6616 1.1178 3.276 0.0169 *
x3 7.6211 1.6573 4.598 0.0037 **
x4 0.8285 0.5394 1.536 0.1754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.541 on 6 degrees of freedom
Multiple R-squared: 0.9668, Adjusted R-squared: 0.9502
F-statistic: 58.22 on 3 and 6 DF, p-value: 7.913e-05
# Extracting r^2 from summary:
Rsq=s$r.squared
Rsq
[1] 0.9667888
round(100*Rsq,1)
[1] 96.7
#(d)
# Regression equation:
beta=coef(fit)
(Intercept) x2 x3 x4
7.6760280 3.6616044 7.6210501 0.8284682
beta=round(beta,2)
beta
(Intercept) x2 x3 x4
7.68 3.66 7.62 0.83
# Thus, regression equation is: x1=7.68 + 3.66*x2 + 7.62*x3 +
0.83*x4
newx3=x3+0.6
newx3
[1] 5.7 6.4 2.7 9.0 3.5 1.8 4.3 8.2 8.3 5.1
newfit=lm(x1~x2+newx3+x4)
newfit
Call:
lm(formula = x1 ~ x2 + newx3 + x4)
Coefficients:
(Intercept) x2 newx3 x4
3.1034 3.6616 7.6211 0.8285
news=summary(newfit)
news
Call:
lm(formula = x1 ~ x2 + newx3 + x4)
Residuals:
Min 1Q Median 3Q Max
-12.4384 -3.1695 0.8499 3.5134 9.6207
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1034 6.9784 0.445 0.6721
x2 3.6616 1.1178 3.276 0.0169 *
newx3 7.6211 1.6573 4.598 0.0037 **
x4 0.8285 0.5394 1.536 0.1754
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.541 on 6 degrees of freedom
Multiple R-squared: 0.9668, Adjusted R-squared: 0.9502
F-statistic: 58.22 on 3 and 6 DF, p-value: 7.913e-05
newbeta=coef(newfit)
newbeta
(Intercept) x2 newx3 x4
3.1033980 3.6616044 7.6210501 0.8284682
nbeta=round(newbeta,2)
nbeta
(Intercept) x2 newx3 x4
3.10 3.66 7.62 0.83
# Thus, the new regression equation is: x1=3.10 + 3.66*x2 + 7.62*x3
+ 0.83*x4