In: Statistics and Probability
What are the various components of the output and how is the output of value to understanding data in regression analysis?
Solution:
The ability to determine model fit is a tricky process. The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis. Following are some metrics you can use to evaluate your regression model:
R Square (Coefficient of Determination) - As
explained above, this metric explains the percentage of variance
explained by covariates in the model. It ranges between 0 and 1.
Usually, higher values are desirable but it rests on the data
quality and domain. For example, if the data is noisy, you'd be
happy to accept a model at low R² values. But it's a good practice
to consider adjusted R² than R² to determine model fit.
Adjusted R²- The problem with R² is that it
keeps on increasing as you increase the number of variables,
regardless of the fact that the new variable is actually adding new
information to the model. To overcome that, we use adjusted R²
which doesn't increase (stays same or decrease) unless the newly
added variable is truly useful.
F Statistics - It evaluates the overall
significance of the model. It is the ratio of explained variance by
the model by unexplained variance. It compares the full model with
an intercept only (no predictors) model. Its value can range
between zero and any arbitrary large number. Naturally, higher the
F statistics, better the model.
RMSE / MSE / MAE - Error metric is the crucial
evaluation number we must check. Since all these are errors, lower
the number, better the model. Let's look at them one by one:
MSE - This is mean
squared error. It tends to amplify the impact of outliers on the
model's accuracy. For example, suppose the actual y is 10 and
predictive y is 30, the resultant MSE would be (30-10)² =
400.
MAE - This is mean
absolute error. It is robust against the effect of outliers. Using
the previous example, the resultant MAE would be (30-10) = 20
RMSE - This is root mean
square error. It is interpreted as how far on an average, the
residuals are from zero. It nullifies squared effect of MSE by
square root and provides the result in original units as data.
Let's understand the regression output in detail:
Intercept - This is the βo value. It's the
prediction made by model when all the independent variables are set
to zero.
Estimate - This represents regression
coefficients for respective variables. It's the value of slope.
Let's interpret it for Chord_Length. We can say, when Chord_Length
is increased by 1 unit, holding other variables constant,
Sound_pressure_level decreases by a value of -35.69.
Std. Error - This determines the level of
variability associated with the estimates. Smaller the standard
error of an estimate is, more accurate will be the
predictions.
t value - t statistic is generally used to
determine variable significance, i.e. if a variable is
significantly adding information to the model. t value > 2
suggests the variable is significant. I used it as an optional
value as the same information can be extracted from the p
value.
p value - It's the probability value of
respective variables determining their significance in the model. p
value < 0.05 is always desirable.