In: Accounting
For this problem you will be creating your own linear model for data of your choosing. Please do the following: • Find a data set that you believe to be linear. You can measure and collect your own data or search the internet. Be sure to cite where you get your data. • Plot the data on a coordinate plane. Include a link to this graph in your submission of this assignment. • Approximate a slope and intercept for your data then write the equation for the line. You can plot the line on the graph of your data. • Reflect on the process and tell me about your solution. Is it accurate? Can it predict unknown values? What are its limitations?
Model specification is the process of determining which independent variables to include and exclude from a regression equation. How do you choose the best regression model? The world is complicated, and trying to explain it with a small sample doesn’t help. In this post, I’ll show you how to select the correct model. I’ll cover statistical methods, difficulties that can arise, and provide practical suggestions for selecting your model. Often, the variable selection process is a mixture of statistics, theory, and practical knowledge.
The need for model selection often begins when a researcher wants to mathematically define the relationship between independent variables and the dependent variable. Typically, investigators measure many variables but include only some in the model. Analysts try to exclude independent variables that are not related and include only those that have an actual relationship with the dependent variable. During the specification process, the analysts typically try different combinations of variables and various forms of the model. For example, they can try different terms that explain interactions between variables and curvature in the data.
The analysts need to reach a Goldilocks balance by including the correct number of independent variables in the regression equation.
Too few: Underspecified models tend to be biased.
Too many: Overspecified models tend to be less precise.
Just right: Models with the correct terms are not biased and are
the most precise.
To avoid biased results, your regression equation should contain
any independent variables that you are specifically testing as part
of the study plus other variables that affect the dependent
variable.
Statistical Methods for Model Specification
You can use statistical assessments during the model specification
process. Various metrics and algorithms can help you determine
which independent variables to include in your regression equation.
I review some standard approaches to model selection, but please
click the links to read my more detailed posts about them.
Adjusted R-squared and Predicted R-squared: Typically, you want to select models that have larger adjusted and predicted R-squared values. These statistics can help you avoid the fundamental problem with regular R-squared—it always increases when you add an independent variable. This property tempts you into specifying a model that is too complex, which can produce misleading results.
Adjusted R-squared increases only when a new variable improves
the model by more than chance. Low-quality variables can cause it
to decrease.
Predicted R-squared is a cross-validation method that can also
decrease. Cross-validation partitions your data to determine
whether the model is generalizable outside of your dataset.
P-values for the independent variables: In regression, p-values
less than the significance level indicate that the term is
statistically significant. “Reducing the model” is the process of
including all candidate variables in the model, and then repeatedly
removing the single term with the highest non-significant p-value
until your model contains only significant terms.
Stepwise regression and Best subsets regression: These two automated model selection procedures are algorithms that pick the variables to include in your regression equation. These automated methods can be helpful when you have many independent variables, and you need some help in the investigative stages of the variable selection process. These procedures can provide the Mallows’ Cp statistic, which helps you balance the tradeoff between precision and bias.