In: Statistics and Probability
How does the principle of parsimony apply to model selection?
Parsimonious models are simple models with great explanatory
predictive power. They explain data with a minimum number of
parameters .According to the principle of parsimony, model
selection methods should value both descriptive accuracy and
simplicity.The general principle of parsimonious data modeling
states that if two models in some way adequately model a given set
of data, the one that is described by a fewer number of parameters
will have better predictive ability given new data. This concept is
of interest in multivariate calibration since several new
non-linear modeling techniques have become available.
Between goodness of fit and parsimony there is there is tradeoff:
low parsimony models (i.e. models with many parameters) tend to
have a better fit than high parsimony models. Adding more
parameters genrally results in a good model fit for the data at
hand, but that same model will likely be useless for predicting
other data sets.
The right balance between parsimony and goodness of fit can be
challenging. Popular methods include Akaike’s Information
Criterion (AIC), Bayesian Information Criterion (BIC), Bayes
Factors and Minimum Description Length compares the
quality of a set of models; The AIC will rank each option from best
to worst. The most parsimonious model will be the one that neither
under-fits nor over-fits. One downside is that the AIC says nothing
about quality; If you input a series of poor models, the AIC will
choose the best from that poor-quality set.
The Bayesian Information Criterion (BIC) is almost the same as the
AIC, although it tends to favor models with fewer parameters. It
compares models using prior distributions and is similar to the
Likelihood Ratio Test, but models do not have to be nested. Model
selection based on Bayes Factors can be approximately equal to BIC
model selection. However, BIC doesn’t require knowledge of priors
so it is often preferred.
Minimum Description Length (MDL): commonly used in
computer and information science, it works on the basis that
strings of related data can be compressed, reducing the number of
predictor variables.