Question

In: Economics

Derivation, meaning and the interpretation of estimated (parameter) coefficients. Explain the factors to determine the quality...

Derivation, meaning and the interpretation of estimated (parameter) coefficients.

Explain the factors to determine the quality of the regression equation.

Explain the nature and function of the OLS assumptions.

Explain the process of hypothesis testing for the parameters.

Explain the source and derivation the t-test, its use, and possible abuses.

tests for estimated coefficients and overall equation.

Solutions

Expert Solution

A:_The regression equation describes the relationship between two variables and is given by the general format:

Formula 2.40

Y = a + bX + ?

Where: Y = dependent variable; X = independent variable,
a = intercept of regression line; b = slope of regression line,
? = error term

In this format, given that Y is dependent on X, the slope b indicates the unit changes in Y for every unit change in X. If b = 0.66, it means that every time X increases (or decreases) by a certain amount, Y increases (or decreases) by 0.66*that amount. The intercept a indicates the value of Y at the point where X = 0. Thus if X indicated market returns, the intercept would show how the dependent variable performs when the market has a flat quarter where returns are 0. In investment parlance, a manager has a positive alpha because a linear regression between the manager's performance and the performance of the market has an intercept number a greater than 0.

Linear Regression - Assumptions
Drawing conclusions about the dependent variable requires that we make six assumptions, the classic assumptions in relation to the linear regression model:

The relationship between the dependent variable Y and the independent variable X is linear in the slope and intercept parameters a and b. This requirement means that neither regression parameter can be multiplied or divided by another regression parameter (e.g. a/b), and that both parameters are raised to the first power only. In other words, we can't construct a linear model where the equation was Y = a + b2X + ?, as unit changes in X would then have a b2 effect on a, and the relation would be nonlinear.
The independent variable X is not random.
The expected value of the error term "?" is 0. Assumptions #2 and #3 allow the linear regression model to produce estimates for slope b and intercept a.
The variance of the error term is constant for all observations. Assumption #4 is known as the "homoskedasticity assumption". When a linear regression is heteroskedastic its error terms vary and the model may not be useful in predicting values of the dependent variable.
The error term ? is uncorrelated across observations; in other words, the covariance between the error term of one observation and the error term of the other is assumed to be 0. This assumption is necessary to estimate the variances of the parameters.
The distribution of the error terms is normal. Assumption #6 allows hypothesis-testing methods to be applied to linear-regression models.

Standard Error of Estimate
Abbreviated SEE, this measure gives an indication of how well a linear regression model is working. It compares actual values in the dependent variable Y to the predicted values that would have resulted had Y followed exactly from the linear regression. For example, take a case where a company's financial analyst has developed a regression model relating annual GDP growth to company sales growth by the equation Y = 1.4 + 0.8X.

Assume the following experience (on the next page) over a five-year period; predicted data is a function of the model and GDP, and "actual" data indicates what happened at the company:

Year
(Xi) GDP growth
Predicted co. growth (Yi)
Actual co. Growth (Yi)
Residual
(Yi - Yi)
Squared residual
1
5.1
5.5
5.2
-0.3
0.09
2
2.1
3.1
2.7
-0.4
0.16
3
-0.9
0.7
1.5
0.8
0.64
4
0.2
1.6
3.1
1.5
2.25
5
6.4
6.5
6.3
-0.2
0.04

To find the standard error of the estimate, we take the sum of all squared residual terms and divide by (n - 2), and then take the square root of the result. In this case, the sum of the squared residuals is 0.09+0.16+0.64+2.25+0.04 = 3.18. With five observations, n - 2 = 3, and SEE = (3.18/3)1/2 = 1.03%.

The computation for standard error is relatively similar to that of standard deviation for a sample (n - 2 is used instead of n - 1). It gives some indication of the predictive quality of a regression model, with lower SEE numbers indicating that more accurate predictions are possible. However, the standard-error measure doesn't indicate the extent to which the independent variable explains variations in the dependent model.

Coefficient of Determination
Like the standard error, this statistic gives an indication of how well a linear-regression model serves as an estimator of values for the dependent variable. It works by measuring the fraction of total variation in the dependent variable that can be explained by variation in the independent variable.

In this context, total variation is made up of two fractions:

Total variation = explained variation + unexplained variation
total variation total variation

The coefficient of determination, or explained variation as a percentage of total variation, is the first of these two terms. It is sometimes expressed as 1 - (unexplained variation / total variation).

For a simple linear regression with one independent variable, the simple method for computing the coefficient of determination is squaring the correlation coefficient between the dependent and independent variables. Since the correlation coefficient is given by r, the coefficient of determination is popularly known as "R2, or R-squared". For example, if the correlation coefficient is 0.76, the R-squared is (0.76)2 = 0.578. R-squared terms are usually expressed as percentages; thus 0.578 would be 57.8%. A second method of computing this number would be to find the total variation in the dependent variable Y as the sum of the squared deviations from the sample mean. Next, calculate the standard error of the estimate following the process outlined in the previous section. The coefficient of determination is then computed by (total variation in Y - unexplained variation in Y) / total variation in Y. This second method is necessary for multiple regressions, where there is more than one independent variable, but for our context we will be provided the r (correlation coefficient) to calculate an R-squared.

What R2 tells us is the changes in the dependent variable Y that are explained by changes in the independent variable X. R2 of 57.8 tells us that 57.8% of the changes in Y result from X; it also means that 1 - 57.8% or 42.2% of the changes in Y are unexplained by X and are the result of other factors. So the higher the R-squared, the better the predictive nature of the linear-regression model.

Regression Coefficients
For either regression coefficient (intercept a, or slope b), a confidence interval can be determined with the following information:

An estimated parameter value from a sample
Standard error of the estimate (SEE)
Significance level for the t-distribution
Degrees of freedom (which is sample size - 2)

For a slope coefficient, the formula for confidence interval is given by b ± tc*SEE, where tc is the critical t value at our chosen significant level.

To illustrate, take a linear regression with a mutual fund's returns as the dependent variable and the S&P 500 index as the independent variable. For five years of quarterly returns, the slope coefficient b is found to be 1.18, with a standard error of the estimate of 0.147. Student's t-distribution for 18 degrees of freedom (20 quarters - 2) at a 0.05 significance level is 2.101. This data gives us a confidence interval of 1.18 ± (0.147)*(2.101), or a range of 0.87 to 1.49. Our interpretation is that there is only a 5% chance that the slope of the population is either less than 0.87 or greater than 1.49 - we are 95% confident that this fund is at least 87% as volatile as the S&P 500, but no more than 149% as volatile, based on our five-year sample.

Hypothesis testing and Regression Coefficients
Regression coefficients are frequently tested using the hypothesis-testing procedure. Depending on what the analyst is intending to prove, we can test a slope coefficient to determine whether it explains chances in the dependent variable, and the extent to which it explains changes. Betas (slope coefficients) can be determined to be either above or below 1 (more volatile or less volatile than the market). Alphas (the intercept coefficient) can be tested on a regression between a mutual fund and the relevant market index to determine whether there is evidence of a sufficiently positive alpha (suggesting value added by the fund manager).

The mechanics of hypothesis testing are similar to the examples we have used previously. A null hypothesis is chosen based on a not-equal-to, greater-than or less-than-case, with the alternative satisfying all values not covered in the null case. Suppose in our previous example where we regressed a mutual fund's returns on the S&P 500 for 20 quarters our hypothesis is that this mutual fund is more volatile than the market. A fund equal in volatility to the market will have slope b of 1.0, so for this hypothesis test, we state the null hypothesis (H0)as the case where slope is less than or greater to 1.0 (i.e. H0: b <1.0). The alternative hypothesis Ha has b > 1.0. We know that this is a greater-than case (i.e. one-tailed) - if we assume a 0.05 significance level, t is equal to 1.734 at degrees of freedom = n - 2 = 18.

Example: Interpreting a Hypothesis Test
From our sample, we had estimated b of 1.18 and standard error of 0.147. Our test statistic is computed with this formula: t = estimated coefficient - hypothesized coeff. / standard error = (1.18 - 1.0)/0.147 = 0.18/0.147, or t = 1.224.

For this example, our calculated test statistic is below the rejection level of 1.734, so we are not able to reject the null hypothesis that the fund is more volatile than the market.

Interpretation: the hypothesis that b > 1 for this fund probably needs more observations (degrees of freedom) to be proven with statistical significance. Also, with 1.18 only slightly above 1.0, it is quite possible that this fund is actually not as volatile as the market, and we were correct to not reject the null hypothesis.

Example: Interpreting a regression coefficient
The CFA exam is likely to give the summary statistics of a linear regression and ask for interpretation. To illustrate, assume the following statistics for a regression between a small-cap growth fund and the Russell 2000 index:

Correlation coefficient 0.864
Intercept -0.417
Slope 1.317

What do each of these numbers tell us?

Variation in the fund is about 75%, explained by changes in the Russell 2000 index. This is true because the square of the correlation coefficient, (0.864)2 = 0.746, gives us the coefficient of determination or R-squared.
The fund will slightly underperform the index when index returns are flat. This results from the value of the intercept being -0.417. When X = 0 in the regression equation, the dependent variable is equal to the intercept.
The fund will on average be more volatile than the index. This fact follows from the slope of the regression line of 1.317 (i.e. for every 1% change in the index, we expect the fund's return to change by 1.317%).
The fund will outperform in strong market periods, and underperform in weak markets. This fact follows from the regression. Additional risk is compensated with additional reward, with the reverse being true in down markets. Predicted values of the fund's return, given a return for the market, can be found by solving for Y = -0.417 + 1.317X (X = Russell 2000 return).

Analysis of Variance (ANOVA)
Analysis of variance, or ANOVA, is a procedure in which the total variability of a random variable is subdivided into components so that it can be better understood, or attributed to each of the various sources that cause the number to vary.

Applied to regression parameters, ANOVA techniques are used to determine the usefulness in a regression model, and the degree to which changes in an independent variable X can be used to explain changes in a dependent variable Y. For example, we can conduct a hypothesis-testing procedure to determine whether slope coefficients are equal to zero (i.e. the variables are unrelated), or if there is statistical meaning to the relationship (i.e. the slope b is different from zero). An F-test can be used for this process.

F-Test
The formula for F-statistic in a regression with one independent variable is given by the following:

Formula 2.41
F = mean regression sum of squares / mean squared error
= (RSS/1) / [SSE/(n - 2)]

The two abbreviations to understand are RSS and SSE:

RSS, or the regression sum of squares, is the amount of total variation in the dependent variable Y that is explained in the regression equation. The RSS is calculated by computing each deviation between a predicted Y value and the mean Y value, squaring the deviation and adding up all terms. If an independent variable explains none of the variations in a dependent variable, then the predicted values of Y are equal to the average value, and RSS = 0.
SSE, or the sum of squared error of residuals, is calculated by finding the deviation between a predicted Y and an actual Y, squaring the result and adding up all terms.

TSS, or total variation, is the sum of RSS and SSE. In other words, this ANOVA process breaks variance into two parts: one that is explained by the model and one that is not. Essentially, for a regression equation to have high predictive quality, we need to see a high RSS and a low SSE, which will make the ratio (RSS/1)/[SSE/(n - 2)] high and (based on a comparison with a critical F-value) statistically meaningful. The critical value is taken from the F-distribution and is based on degrees of freedom.

For example, with 20 observations, degrees of freedom would be n - 2, or 18, resulting in a critical value (from the table) of 2.19. If RSS were 2.5 and SSE were 1.8, then the computed test statistic would be F = (2.5/(1.8/18) = 25, which is above the critical value, which indicates that the regression equation has predictive quality (b is different from 0)

Estimating Economic Statistics with Regression Models
Regression models are frequently used to estimate economic statistics such as inflation and GDP growth. Assume the following regression is made between estimated annual inflation (X, or independent variable) A linear regression line is usually determined quantitatively by a best-fit procedure such as least squares (i.e. the distance between the regression line and every observation is minimized). In linear regression, one variable is plotted on the X axis and the other on the Y. The X variable is said to be the independent variable, and the Y is said to be the dependent variable. When analyzing two random variables, you must choose which variable is independent and which is dependent. The choice of independent and dependent follows from the hypothesis - for many examples, this distinction should be intuitive. The most popular use of regression analysis is on investment returns, where the market index is independent while the individual security or mutual fund is dependent on the market. In essence, regression analysis formulates a hypothesis that the movement in one variable (Y) depends on the movement in the other (X).

Regression Equation
The regression equation describes the relationship between two variables and is given by the general format:

Formula 2.40

Y = a + bX + ?

Where: Y = dependent variable; X = independent variable,
a = intercept of regression line; b = slope of regression line,
? = error term

In this format, given that Y is dependent on X, the slope b indicates the unit changes in Y for every unit change in X. If b = 0.66, it means that every time X increases (or decreases) by a certain amount, Y increases (or decreases) by 0.66*that amount. The intercept a indicates the value of Y at the point where X = 0. Thus if X indicated market returns, the intercept would show how the dependent variable performs when the market has a flat quarter where returns are 0. In investment parlance, a manager has a positive alpha because a linear regression between the manager's performance and the performance of the market has an intercept number a greater than 0.
The t-test was developed by a chemist working for the Guinness brewing company as a simple way to measure the consistent quality of stout. It was further developed and adapted, and now refers to any test of a statistical hypothesis in which the statistic being tested for is expected to correspond to a t-distribution if the null hypothesis is supported.

What assumptions are made when conducting a t-test?

By J.B. Maverick | Updated February 14, 2018 — 10:53 AM EST
SHARE

A:

The common assumptions made when doing a t-test include those regarding the scale of measurement, random sampling, normality of data distribution, adequacy of sample size and equality of variance in standard deviation.

The T-Test

The t-test was developed by a chemist working for the Guinness brewing company as a simple way to measure the consistent quality of stout. It was further developed and adapted, and now refers to any test of a statistical hypothesis in which the statistic being tested for is expected to correspond to a t-distribution if the null hypothesis is supported.

A t-test is an analysis of two populations means through the use of statistical examination; a t-test with two samples is commonly used with small sample sizes, testing the difference between the samples when the variances of two normal distributions are not known.

T-distribution is basically any continuous probability distribution that arises from an estimation of the mean of a normally distributed population using a small sample size and an unknown standard deviation for the population. The null hypothesis is the default assumption that no relationship exists between two different measured phenomena.

There are many variables that effect a single variable for example: for agriculture humidity matters, fertilizer, pesticidite, type of seeds, quantity of water, rain, weather condition, no. of tractors etc many variable effect. When you are looking at the effect of one variable on another then you have to take ceteris peribus that is all other things are constant.
A variable can be strongly depndent on other and it can be weakly dependent. It depends on what your regression tells you. I have not come to know of any range that defines weak, moderate or strong dependence. In your case, however, coefficient of 2 variables are around 0.45, you may call the relation to be strong one but do you really need to tell the type of relationship. Dstatistical techniques.

Despite its popularity, interpretation of the regression coefficients of any but the simplest models is sometimes, well….difficult.

So let’s interpret the coefficients of a continuous and a categorical variable. Although the example here is a linear regression model, the approach works for interpreting coefficients from any regression model without interactions, including logistic and proportional hazards models.

A linear regression model with two predictor variables can be expressed with the following equation:

Y = B0 + B1*X1 + B2*X2 + e.

The variables in the model are:

Y, the response variable;
X1, the first predictor variable;
X2, the second predictor variable; and
e, the residual error, which is an unmeasured variable.

The parameters in the model are:

B0, the Y-intercept;
B1, the first regression coefficient; and
B2, the second regression coefficient.

One example would be a model of the height of a shrub (Y) based on the amount of bacteria in the soil (X1) and whether the plant is located in partial or full sun (X2).

Height is measured in cm, bacteria is measured in thousand per ml of soil, and type of sun = 0 if the plant is in partial sun and type of sun = 1 if the plant is in full sun.

Let’s say it turned out that the regression equation was estimated as follows:

Y = 42 + 2.3*X1 + 11*X2

Interpreting the Intercept

B0, the Y-intercept, can be interpreted as the value you would predict for Y if both X1 = 0 and X2 = 0.

We would expect an average height of 42 cm for shrubs in partial sun with no bacteria in the soil. However, this is only a meaningful interpretation if it is reasonable that both X1and X2 can be 0, and if the data set actually included values for X1 and X2 that were near 0.

If neither of these conditions are true, then B0 really has no meaningful interpretation. It just anchors the regression line in the right place. In our case, it is easy to see that X2 sometimes is 0, but if X1, our bacteria level, never comes close to 0, then our intercept has no real interpretation.

Rahul Sunny answered 2 years ago

Next > < Previous

Question

Derivation, meaning and the interpretation of estimated (parameter) coefficients. Explain the factors to determine the quality...

Solutions

Expert Solution

Related Solutions

For Exercises find the coefficients of determination and non determination and explain the meaning of each. r = 0.91

For Exercises find the coefficients of determination and non determination and explain the meaning of each. r = 0.18

For Exercises find the coefficients of determination and non determination and explain the meaning of each. r = 0.42

Post your interpretation of the meaning of Ebrahim’s conclusion. Explain whether you agree with his conclusion...

The following regression equation was estimated: Y = -2.0 + 4.6X. Please explain the meaning of...

Explain factors that are involved in protein quality and how protein quality may play a role...

1.Explain the meaning of the velocity of money and discuss the factors upon which its magnitude...

Develop a least-squares estimated regression line. Also, compute the coefficient of determination and explain its meaning....

Develop a least-squares estimated regression line. Also, compute the coefficient of determination and explain its meaning....

Summarize how a pressure altimeter works to determine and display aircraft altitude and explain the meaning...