In: Statistics and Probability
Note that Walmart's fiscal year starts the first week of February. This means that when analyzing the data, week 26 is actually week 30 (26+4 weeks for January) in 2002 or the end of July 2002. Also, week 52 is actually week 4 (52+4 weeks for January 2002 minus 52 weeks for 2002) in 2003 or the end of January 2003. As an example, the spike in sales(revenue) at week 75 occurs in week 27 (75+4 weeks for January 2002 minus 52 weeks for 2002) in 2003 or the first week in July 2003. This corresponds to sales for the July 4th holiday when people are buying barbecue related items.
1. identify spikes (outliers) in the data where extreme sales values occur and correlate these spikes with actual calendar dates 2002 or 2003 and with holidays or special events that may occur during these periods.
2. Modeling the data linearly -
a. Generate a linear model for this data by choosing two points.
b. Generate a least squares linear regression model for this data.
c. How good is this regression model? Output and discuss the R2 value.
d. What are the marginal sales (derivative, i.e. rate of change) for this department using the linear model with two data points and the regression model?
e. Compare the two models. Which do you feel is better?
f. Remove appropriate outliers as you deem necessary and rerun the linear regression model. What is the marginal sales and discuss improvements.
3. Modeling the data quadratically -
a. Generate a quadratic model for this data. Also output and discuss the R2 value.
b. What are the marginal sales for this department using this model?
c. Calculate the model generated relative max/min value. Show backup analytical work.
d. Compare actual and model generated relative max/min value.
e. Remove outliers and rerun the quadratic least squares model. What is the marginal sales and discuss improvements.
4. Comparing models
a. Based on all models run, which model do you feel best predicts future trends? Explain your rationale.
b. Based on the model selected, what type of seasonal adjustments, if any, would be required to meet customer needs?
weeks 26 |
Sales in dollars 15200 |
27 | 15600 |
28 | 16400 |
29 | 15600 |
30 | 14200 |
31 | 14400 |
32 | 16400 |
33 | 15200 |
34 | 14400 |
35 | 13800 |
36 | 15000 |
37 | 14100 |
38 | 14400 |
39 | 14000 |
40 | 15600 |
41 | 15000 |
42 | 14400 |
43 | 17800 |
44 | 15000 |
45 | 15200 |
46 | 15800 |
47 | 18600 |
48 | 15400 |
49 | 15500 |
50 | 16800 |
51 | 18700 |
52 | 21400 |
53 | 20900 |
54 | 18800 |
55 | 22400 |
56 | 19400 |
57 | 20000 |
58 | 18100 |
59 | 18000 |
60 | 19600 |
61 | 19000 |
62 | 19200 |
63 | 18000 |
64 | 17600 |
65 | 17200 |
66 | 19800 |
67 | 19600 |
68 | 19600 |
69 | 20000 |
70 | 20800 |
71 | 22800 |
72 | 23000 |
73 | 20800 |
74 | 25000 |
75 | 30600 |
76 | 24000 |
77 | 21200 |
1)
From the above box-plot we can conclude that the value 30600 is an outlier which is corresponding week to 4th July in 2003
2)
a)
Using points (26,15200) and (77,21200) we get linear equation as:
y-15200 = {(21200-15200)/(77-26)} (x-26)
so,
y = 117.65 x + 12141 or
Sales = 117.65 Week + 12141
b)
Regression Equation |
Sales = 181 Week + 8741.97
c)
R Square of the model is 0.6506 or 65.06% ie 65% of the variation in the dependent variable can be explained by the independent variable.
d)
Rate of change using model from part a is 117.65
Rate of change using model from part b is 181
e)
Model from 2 is better
f)
After removing the outlier,
Regression Equation |
Sales = 163.2 Week + 9488.06
R Square of the model is 0.6913 or 69.13% ie 69% of the variation in the dependent variable can be explained by the independent variable.
Rate of change is 163.2
PS: We are only allowed to answer 4 parts per question.