In: Statistics and Probability
You wish to estimate as precisely as possible the slope β1 in the simple linear regression model yi = β0 + β1xi + ei , i = 1, . . . , 4. Each pair of observations (xi , yi) costs $ 1:00 and your budget is $ 4:00. A data analyst proposes that you consider one of the following two options:
(a) Make two y-observations at x = 1 and a further two at x = 4;
(b) Make one y-observation at each of the points x = 1; 2; 3 and 4.
Which of the two options would give you the most bang for your bucks? Show the relevant calculation to justify your choice.
This is an observational problem .
We need to find the precise value of the slope β1 .
We are given with two options of four observations and have to identify the best method of observation of the two.
The given regression equation is
yi = β0 + β1xi + ei , ; i = 1, . . . , 4
For a general regression equation as given above the slope(β1) of the independent variable (xi) is given by
where Xi = independent variable at ith term, are the sample means of xi and yi variables . N =number of xi variables .
and the best estimation of the variation of the data points is given by
, is the population mean.
Now observe and try to visualize the outcome value of β1 and the s2 for the two experiments.
Note: in both experiments we are incurring same cost of $4 . So price sensitiveness is not an issue here.
We need to find as to which experiment will give us the most precise estimate of β1.
Lets compare !
In first experiment , we are conducting the observation with only by two values of xi repeated twice.
In this case, observe about the nature of the mean of xi, yi terms. They will clearly be only dependent on x1 and x4 and (y1 and y4) (since we took i=1 and 4 twice) only for respectively.
Contribution from x2 , x3 and y2 and y3 has not been taken into account for calculation of respectively.
So think now !
Will these means be a true representation of the variables ?
No they cant . because they have been inferred from very low volume of variables and hence cant be accurate to the actual value.
So the solution to resolve it can be to take as much volume of variables in consideration of calculation of mean as possible.
This will push the estimated value of the mean to the closest to the actual value of the mean.
Thus higher the data points of variables ,higher will be the level of estimation.
Now look at the formula for the slope
The slope will be more accurate with an accurate value of the means of .
This is acheived with second experimetal observation when we are taking four different actual data points. (x1,x2,x3,x4) and (y1,y2,y3,y4) .