In: Economics
Derivation, meaning and the interpretation of estimated (parameter) coefficients.
Explain the factors to determine the quality of the regression equation.
Explain the nature and function of the OLS assumptions.
Explain the process of hypothesis testing for the parameters.
Explain the source and derivation the t-test, its use, and possible abuses.
tests for estimated coefficients and overall equation.
A:_The regression equation describes the relationship between two
variables and is given by the general format:
Formula 2.40
Y = a + bX + ?
Where: Y = dependent variable; X = independent variable,
a = intercept of regression line; b = slope of regression
line,
? = error term
In this format, given that Y is dependent on X, the slope b
indicates the unit changes in Y for every unit change in X. If b =
0.66, it means that every time X increases (or decreases) by a
certain amount, Y increases (or decreases) by 0.66*that amount. The
intercept a indicates the value of Y at the point where X = 0. Thus
if X indicated market returns, the intercept would show how the
dependent variable performs when the market has a flat quarter
where returns are 0. In investment parlance, a manager has a
positive alpha because a linear regression between the manager's
performance and the performance of the market has an intercept
number a greater than 0.
Linear Regression - Assumptions
Drawing conclusions about the dependent variable requires that we
make six assumptions, the classic assumptions in relation to the
linear regression model:
The relationship between the dependent variable Y and the
independent variable X is linear in the slope and intercept
parameters a and b. This requirement means that neither regression
parameter can be multiplied or divided by another regression
parameter (e.g. a/b), and that both parameters are raised to the
first power only. In other words, we can't construct a linear model
where the equation was Y = a + b2X + ?, as unit changes in X would
then have a b2 effect on a, and the relation would be
nonlinear.
The independent variable X is not random.
The expected value of the error term "?" is 0. Assumptions #2 and
#3 allow the linear regression model to produce estimates for slope
b and intercept a.
The variance of the error term is constant for all observations.
Assumption #4 is known as the "homoskedasticity assumption". When a
linear regression is heteroskedastic its error terms vary and the
model may not be useful in predicting values of the dependent
variable.
The error term ? is uncorrelated across observations; in other
words, the covariance between the error term of one observation and
the error term of the other is assumed to be 0. This assumption is
necessary to estimate the variances of the parameters.
The distribution of the error terms is normal. Assumption #6 allows
hypothesis-testing methods to be applied to linear-regression
models.
Standard Error of Estimate
Abbreviated SEE, this measure gives an indication of how well a
linear regression model is working. It compares actual values in
the dependent variable Y to the predicted values that would have
resulted had Y followed exactly from the linear regression. For
example, take a case where a company's financial analyst has
developed a regression model relating annual GDP growth to company
sales growth by the equation Y = 1.4 + 0.8X.
Assume the following experience (on the next page) over a five-year
period; predicted data is a function of the model and GDP, and
"actual" data indicates what happened at the company:
Year
(Xi) GDP growth
Predicted co. growth (Yi)
Actual co. Growth (Yi)
Residual
(Yi - Yi)
Squared residual
1
5.1
5.5
5.2
-0.3
0.09
2
2.1
3.1
2.7
-0.4
0.16
3
-0.9
0.7
1.5
0.8
0.64
4
0.2
1.6
3.1
1.5
2.25
5
6.4
6.5
6.3
-0.2
0.04
To find the standard error of the estimate, we take the sum of all
squared residual terms and divide by (n - 2), and then take the
square root of the result. In this case, the sum of the squared
residuals is 0.09+0.16+0.64+2.25+0.04 = 3.18. With five
observations, n - 2 = 3, and SEE = (3.18/3)1/2 = 1.03%.
The computation for standard error is relatively similar to that of
standard deviation for a sample (n - 2 is used instead of n - 1).
It gives some indication of the predictive quality of a regression
model, with lower SEE numbers indicating that more accurate
predictions are possible. However, the standard-error measure
doesn't indicate the extent to which the independent variable
explains variations in the dependent model.
Coefficient of Determination
Like the standard error, this statistic gives an indication of how
well a linear-regression model serves as an estimator of values for
the dependent variable. It works by measuring the fraction of total
variation in the dependent variable that can be explained by
variation in the independent variable.
In this context, total variation is made up of two fractions:
Total variation = explained variation + unexplained variation
total variation total variation
The coefficient of determination, or explained variation as a
percentage of total variation, is the first of these two terms. It
is sometimes expressed as 1 - (unexplained variation / total
variation).
For a simple linear regression with one independent variable, the
simple method for computing the coefficient of determination is
squaring the correlation coefficient between the dependent and
independent variables. Since the correlation coefficient is given
by r, the coefficient of determination is popularly known as "R2,
or R-squared". For example, if the correlation coefficient is 0.76,
the R-squared is (0.76)2 = 0.578. R-squared terms are usually
expressed as percentages; thus 0.578 would be 57.8%. A second
method of computing this number would be to find the total
variation in the dependent variable Y as the sum of the squared
deviations from the sample mean. Next, calculate the standard error
of the estimate following the process outlined in the previous
section. The coefficient of determination is then computed by
(total variation in Y - unexplained variation in Y) / total
variation in Y. This second method is necessary for multiple
regressions, where there is more than one independent variable, but
for our context we will be provided the r (correlation coefficient)
to calculate an R-squared.
What R2 tells us is the changes in the dependent variable Y that
are explained by changes in the independent variable X. R2 of 57.8
tells us that 57.8% of the changes in Y result from X; it also
means that 1 - 57.8% or 42.2% of the changes in Y are unexplained
by X and are the result of other factors. So the higher the
R-squared, the better the predictive nature of the
linear-regression model.
Regression Coefficients
For either regression coefficient (intercept a, or slope b), a
confidence interval can be determined with the following
information:
An estimated parameter value from a sample
Standard error of the estimate (SEE)
Significance level for the t-distribution
Degrees of freedom (which is sample size - 2)
For a slope coefficient, the formula for confidence interval is
given by b ± tc*SEE, where tc is the critical t value at our chosen
significant level.
To illustrate, take a linear regression with a mutual fund's
returns as the dependent variable and the S&P 500 index as the
independent variable. For five years of quarterly returns, the
slope coefficient b is found to be 1.18, with a standard error of
the estimate of 0.147. Student's t-distribution for 18 degrees of
freedom (20 quarters - 2) at a 0.05 significance level is 2.101.
This data gives us a confidence interval of 1.18 ± (0.147)*(2.101),
or a range of 0.87 to 1.49. Our interpretation is that there is
only a 5% chance that the slope of the population is either less
than 0.87 or greater than 1.49 - we are 95% confident that this
fund is at least 87% as volatile as the S&P 500, but no more
than 149% as volatile, based on our five-year sample.
Hypothesis testing and Regression Coefficients
Regression coefficients are frequently tested using the
hypothesis-testing procedure. Depending on what the analyst is
intending to prove, we can test a slope coefficient to determine
whether it explains chances in the dependent variable, and the
extent to which it explains changes. Betas (slope coefficients) can
be determined to be either above or below 1 (more volatile or less
volatile than the market). Alphas (the intercept coefficient) can
be tested on a regression between a mutual fund and the relevant
market index to determine whether there is evidence of a
sufficiently positive alpha (suggesting value added by the fund
manager).
The mechanics of hypothesis testing are similar to the examples we
have used previously. A null hypothesis is chosen based on a
not-equal-to, greater-than or less-than-case, with the alternative
satisfying all values not covered in the null case. Suppose in our
previous example where we regressed a mutual fund's returns on the
S&P 500 for 20 quarters our hypothesis is that this mutual fund
is more volatile than the market. A fund equal in volatility to the
market will have slope b of 1.0, so for this hypothesis test, we
state the null hypothesis (H0)as the case where slope is less than
or greater to 1.0 (i.e. H0: b <1.0). The alternative hypothesis
Ha has b > 1.0. We know that this is a greater-than case (i.e.
one-tailed) - if we assume a 0.05 significance level, t is equal to
1.734 at degrees of freedom = n - 2 = 18.
Example: Interpreting a Hypothesis Test
From our sample, we had estimated b of 1.18 and standard error of
0.147. Our test statistic is computed with this formula: t =
estimated coefficient - hypothesized coeff. / standard error =
(1.18 - 1.0)/0.147 = 0.18/0.147, or t = 1.224.
For this example, our calculated test statistic is below the
rejection level of 1.734, so we are not able to reject the null
hypothesis that the fund is more volatile than the market.
Interpretation: the hypothesis that b > 1 for this fund probably
needs more observations (degrees of freedom) to be proven with
statistical significance. Also, with 1.18 only slightly above 1.0,
it is quite possible that this fund is actually not as volatile as
the market, and we were correct to not reject the null
hypothesis.
Example: Interpreting a regression coefficient
The CFA exam is likely to give the summary statistics of a linear
regression and ask for interpretation. To illustrate, assume the
following statistics for a regression between a small-cap growth
fund and the Russell 2000 index:
Correlation coefficient 0.864
Intercept -0.417
Slope 1.317
What do each of these numbers tell us?
Variation in the fund is about 75%, explained by changes in the
Russell 2000 index. This is true because the square of the
correlation coefficient, (0.864)2 = 0.746, gives us the coefficient
of determination or R-squared.
The fund will slightly underperform the index when index returns
are flat. This results from the value of the intercept being
-0.417. When X = 0 in the regression equation, the dependent
variable is equal to the intercept.
The fund will on average be more volatile than the index. This fact
follows from the slope of the regression line of 1.317 (i.e. for
every 1% change in the index, we expect the fund's return to change
by 1.317%).
The fund will outperform in strong market periods, and underperform
in weak markets. This fact follows from the regression. Additional
risk is compensated with additional reward, with the reverse being
true in down markets. Predicted values of the fund's return, given
a return for the market, can be found by solving for Y = -0.417 +
1.317X (X = Russell 2000 return).
Analysis of Variance (ANOVA)
Analysis of variance, or ANOVA, is a procedure in which the total
variability of a random variable is subdivided into components so
that it can be better understood, or attributed to each of the
various sources that cause the number to vary.
Applied to regression parameters, ANOVA techniques are used to
determine the usefulness in a regression model, and the degree to
which changes in an independent variable X can be used to explain
changes in a dependent variable Y. For example, we can conduct a
hypothesis-testing procedure to determine whether slope
coefficients are equal to zero (i.e. the variables are unrelated),
or if there is statistical meaning to the relationship (i.e. the
slope b is different from zero). An F-test can be used for this
process.
F-Test
The formula for F-statistic in a regression with one independent
variable is given by the following:
Formula 2.41
F = mean regression sum of squares / mean squared error
= (RSS/1) / [SSE/(n - 2)]
The two abbreviations to understand are RSS and SSE:
RSS, or the regression sum of squares, is the amount of total
variation in the dependent variable Y that is explained in the
regression equation. The RSS is calculated by computing each
deviation between a predicted Y value and the mean Y value,
squaring the deviation and adding up all terms. If an independent
variable explains none of the variations in a dependent variable,
then the predicted values of Y are equal to the average value, and
RSS = 0.
SSE, or the sum of squared error of residuals, is calculated by
finding the deviation between a predicted Y and an actual Y,
squaring the result and adding up all terms.
TSS, or total variation, is the sum of RSS and SSE. In other
words, this ANOVA process breaks variance into two parts: one that
is explained by the model and one that is not. Essentially, for a
regression equation to have high predictive quality, we need to see
a high RSS and a low SSE, which will make the ratio (RSS/1)/[SSE/(n
- 2)] high and (based on a comparison with a critical F-value)
statistically meaningful. The critical value is taken from the
F-distribution and is based on degrees of freedom.
For example, with 20 observations, degrees of freedom would be n -
2, or 18, resulting in a critical value (from the table) of 2.19.
If RSS were 2.5 and SSE were 1.8, then the computed test statistic
would be F = (2.5/(1.8/18) = 25, which is above the critical value,
which indicates that the regression equation has predictive quality
(b is different from 0)
Estimating Economic Statistics with Regression Models
Regression models are frequently used to estimate economic
statistics such as inflation and GDP growth. Assume the following
regression is made between estimated annual inflation (X, or
independent variable) A linear regression line is usually
determined quantitatively by a best-fit procedure such as least
squares (i.e. the distance between the regression line and every
observation is minimized). In linear regression, one variable is
plotted on the X axis and the other on the Y. The X variable is
said to be the independent variable, and the Y is said to be the
dependent variable. When analyzing two random variables, you must
choose which variable is independent and which is dependent. The
choice of independent and dependent follows from the hypothesis -
for many examples, this distinction should be intuitive. The most
popular use of regression analysis is on investment returns, where
the market index is independent while the individual security or
mutual fund is dependent on the market. In essence, regression
analysis formulates a hypothesis that the movement in one variable
(Y) depends on the movement in the other (X).
Regression Equation
The regression equation describes the relationship between two
variables and is given by the general format:
Formula 2.40
Y = a + bX + ?
Where: Y = dependent variable; X = independent variable,
a = intercept of regression line; b = slope of regression
line,
? = error term
In this format, given that Y is dependent on X, the slope b
indicates the unit changes in Y for every unit change in X. If b =
0.66, it means that every time X increases (or decreases) by a
certain amount, Y increases (or decreases) by 0.66*that amount. The
intercept a indicates the value of Y at the point where X = 0. Thus
if X indicated market returns, the intercept would show how the
dependent variable performs when the market has a flat quarter
where returns are 0. In investment parlance, a manager has a
positive alpha because a linear regression between the manager's
performance and the performance of the market has an intercept
number a greater than 0.
The t-test was developed by a chemist working for the Guinness
brewing company as a simple way to measure the consistent quality
of stout. It was further developed and adapted, and now refers to
any test of a statistical hypothesis in which the statistic being
tested for is expected to correspond to a t-distribution if the
null hypothesis is supported.
What assumptions are made when conducting a t-test?
By J.B. Maverick | Updated February 14, 2018 — 10:53 AM
EST
SHARE
A:
The common assumptions made when doing a t-test include those regarding the scale of measurement, random sampling, normality of data distribution, adequacy of sample size and equality of variance in standard deviation.
The T-Test
The t-test was developed by a chemist working for the Guinness brewing company as a simple way to measure the consistent quality of stout. It was further developed and adapted, and now refers to any test of a statistical hypothesis in which the statistic being tested for is expected to correspond to a t-distribution if the null hypothesis is supported.
A t-test is an analysis of two populations means through the use of statistical examination; a t-test with two samples is commonly used with small sample sizes, testing the difference between the samples when the variances of two normal distributions are not known.
T-distribution is basically any continuous probability distribution that arises from an estimation of the mean of a normally distributed population using a small sample size and an unknown standard deviation for the population. The null hypothesis is the default assumption that no relationship exists between two different measured phenomena.
There are many variables that effect a single variable for
example: for agriculture humidity matters, fertilizer, pesticidite,
type of seeds, quantity of water, rain, weather condition, no. of
tractors etc many variable effect. When you are looking at the
effect of one variable on another then you have to take ceteris
peribus that is all other things are constant.
A variable can be strongly depndent on other and it can be weakly
dependent. It depends on what your regression tells you. I have not
come to know of any range that defines weak, moderate or strong
dependence. In your case, however, coefficient of 2 variables are
around 0.45, you may call the relation to be strong one but do you
really need to tell the type of
relationship. Dstatistical techniques.
Despite its popularity, interpretation of the regression coefficients of any but the simplest models is sometimes, well….difficult.
So let’s interpret the coefficients of a continuous and a categorical variable. Although the example here is a linear regression model, the approach works for interpreting coefficients from any regression model without interactions, including logistic and proportional hazards models.
A linear regression model with two predictor variables can be expressed with the following equation:
Y = B0 + B1*X1 + B2*X2 + e.
The variables in the model are:
Y, the response variable;
X1, the first predictor variable;
X2, the second predictor variable; and
e, the residual error, which is an unmeasured variable.
The parameters in the model are:
B0, the Y-intercept;
B1, the first regression coefficient; and
B2, the second regression coefficient.
One example would be a model of the height of a shrub (Y) based on the amount of bacteria in the soil (X1) and whether the plant is located in partial or full sun (X2).
Height is measured in cm, bacteria is measured in thousand per ml of soil, and type of sun = 0 if the plant is in partial sun and type of sun = 1 if the plant is in full sun.
Let’s say it turned out that the regression equation was estimated as follows:
Y = 42 + 2.3*X1 + 11*X2
Interpreting the Intercept
B0, the Y-intercept, can be interpreted as the value you would predict for Y if both X1 = 0 and X2 = 0.
We would expect an average height of 42 cm for shrubs in partial sun with no bacteria in the soil. However, this is only a meaningful interpretation if it is reasonable that both X1and X2 can be 0, and if the data set actually included values for X1 and X2 that were near 0.
If neither of these conditions are true, then B0 really has no meaningful interpretation. It just anchors the regression line in the right place. In our case, it is easy to see that X2 sometimes is 0, but if X1, our bacteria level, never comes close to 0, then our intercept has no real interpretation.