In: Statistics and Probability
Hello,
Regression analysis is a statistical technique for analysing and modelling the relationship between dependent variable and one or more independent variables. This technique uses the mathematical equation to establish the relationship between variables. It is a predictive modelling technique used for forecasting and to find casual effect relationship between the variables.
The equation of a straight line relating these two variables is given by y=a+bx..
The difference between the observed value of y and the fitted straight line is a statistical error ε. It is a random variable that accounts for the failure of the model to fit the data exactly.
The major assumptions of the regression analysis are as follows: [4].
i. The relationship between the response y and the regressor’s x is linear, at least approximately.
ii. The error term ε has zero mean.
iii. The error term ε has constant variance σ2 .
iv. The errors are uncorrelated. v. The errors are normally distributed
Outliers
Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.
Each type of outlier is depicted graphically in the scatterplots below.
Influential Points
An influential point is an outlier that greatly affects the slope of the regression line. One way to test the influence of an outlier is to compute the regression equation with and without the outlier.
This type of analysis is illustrated below. The scatterplots are identical, except that one plot includes an outlier. When the outlier is present, the slope is flatter (-4.10 vs. -3.32); so this outlier would be considered an influential point.
The charts below compare regression statistics for another data set with and without an outlier. Here, one chart has a single outlier, located at the high end of the X axis (where x = 24). As a result of that single outlier, the slope of the regression line changes greatly, from -2.5 to -1.6; so the outlier would be considered an influential point.
Sometimes, an influential point will cause the coefficient of determination to be bigger; sometimes, smaller. In the first example above, the coefficient of determination is smaller when the influential point is present (0.94 vs. 0.55). In the second example, it is bigger (0.46 vs. 0.52).
If your data set includes an influential point, here are some things to consider.