Question

In: Statistics and Probability

Write down the regression model form with two quantitative inputs and one qualitative input with three...

Write down the regression model form with two quantitative inputs and one qualitative input with three discrete levels. For each discrete level, we wish to have a complete second order model of the quantitative variables.

Solutions

Expert Solution


Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data.

I’ll provide an overview along with information to help you choose. I organize the types of regression by the different kinds of dependent variable. If you’re not sure which procedure to use, determine which type of dependent variable you have, and then focus on that section in this post. This process should help narrow the choices! I’ll cover regression models that are appropriate for dependent variables that measure continuous, categorical, and count data.

Related post: Guide to Data Types and How to Graph Them

Regression Analysis with Continuous Dependent Variables

Regression analysis with a continuous dependent variable is probably the first type that comes to mind. While this is the primary case, you still need to decide which one to use.

Continuous variables are a measurement on a continuous scale, such as weight, time, and length.

Linear regression

OLS produces the fitted line that minimizes the sum of the squared differences between the data points and the line.

Linear regression, also known as ordinary least squares (OLS) and linear least squares, is the real workhorse of the regression world. Use linear regression to understand the mean change in a dependent variable given a one-unit change in each independent variable. You can also use polynomials to model curvature and include interaction effects. Despite the term “linear model,” this type can model curvature.

This analysis estimates parameters by minimizing the sum of the squared errors (SSE). Linear models are the most common and most straightforward to use. If you have a continuous dependent variable, linear regression is probably the first type you should consider.

There are some special options available for linear regression.

  • Linear model that uses a polynomial to model curvature

    Fitted line plots: If you have one independent variable and the dependent variable, use a fitted line plot to display the data along with the fitted regression line and essential regression output. These graphs make understanding the model more intuitive.

  • Stepwise regression and Best subsets regression: These automated methods can help identify candidate variables early in the model specification process.

Advanced types of linear regression

Linear models are the oldest type of regression. It was designed so that statisticians can do the calculations by hand. However, OLS has several weaknesses, including a sensitivity to both outliers and multicollinearity, and it is prone to overfitting. To address these problems, statisticians have developed several advanced variants:

  • Ridge regression allows you to analyze data even when severe multicollinearity is present and helps prevent overfitting. This type of model reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates. The procedure trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.
  • Lasso regression (least absolute shrinkage and selection operator) performs variable selection that aims to increase prediction accuracy by identifying a simpler model. It is similar to Ridge regression but with variable selection.
  • Partial least squares (PLS) regression is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Components Analysis. Then, the procedure performs linear regression on these components rather the original data. PLS emphasizes developing predictive models and is not used for screening variables. Unlike OLS, you can include multiple continuous dependent variables. PLS uses the correlationstructure to identify smaller effects and model multivariate patterns in the dependent variables.

Nonlinear regression

Nonlinear regression also requires a continuous dependent variable, but it provides a greater flexibility to fit curves than linear regression.

Like OLS, nonlinear regression estimates the parameters by minimizing the SSE. However, nonlinear models use an iterative algorithm rather than the linear approach of solving them directly with matrix equations. What this means for you is that you need to worry about which algorithm to use, specifying good starting values, and the possibility of either not converging on a solution or converging on a local minimum rather than a global minimum SSE. And, that’s in addition to specifying the correct functional form!

Nonlinear model of electron mobility by density.

Most nonlinear models have one continuous independent variable, but it is possible to have more than one. When you have one independent variable, you can graph the results using a fitted line plot.

My advice is to fit a model using linear regression first and then determine whether the linear model provides an adequate fit by checking the residual plots. If you can’t obtain a good fit using linear regression, then try a nonlinear model because it can fit a wider variety of curves. I always recommend that you try OLS first because it is easier to perform and interpret.

I’ve written quite a bit about the differences between linear and nonlinear models. Read the following posts to learn the differences between these two types, how to choose which one is best for your data, and how to interpret the results.

  • What is the Difference Between Linear and Nonlinear Models?
  • How to Choose Between Linear and Nonlinear Regression?
  • Curve Fitting with Linear and Nonlinear Regression

Regression Analysis with Categorical Dependent Variables

So far, we’ve looked at models that require a continuous dependent variable. Next, let’s move on to categorical independent variables. A categorical variable has values that you can put into a countable number of distinct groups based on a characteristic. Logistic regression transforms the dependent variable and then uses Maximum Likelihood Estimation, rather than least squares, to estimate the parameters.

Logistic regression describes the relationship between a set of independent variables and a categorical dependent variable. Choose the type of logistic model based on the type of categorical dependent variable you have.

Binary Logistic Regression

Use binary logistic regression to understand how changes in the independent variables are associated with changes in the probability of an event occurring. This type of model requires a binary dependent variable. A binary variable has only two possible values, such as pass and fail.

Example: Political scientists assess the odds of the incumbent U.S. President winning reelection based on stock market performance.

Read my post about a binary logistic model that estimates the probability of House Republicans belonging to the Freedom Caucus.

Ordinal Logistic Regression

Ordinal logistic regression models the relationship between a set of predictors and an ordinal response variable. An ordinal response has at least three groups which have a natural order, such as hot, medium, and cold.

Example: Market analysts want to determine which variables influence the decision to buy large, medium, or small popcorn at the movie theater.

Nominal Logistic Regression

Nominal logistic regression models the relationship between a set of independent variables and a nominal dependent variable. A nominal variable has at least three groups which do not have a natural order, such as scratch, dent, and tear.

Example: A quality analyst studies the variables that affect the odds of the type of product defects: scratches, dents, and tears.

Regression Analysis with Count Dependent Variables

If your dependent variable is a count of items, events, results, or activities, you might need to use a different type of regression model. Counts are nonnegative integers (0, 1, 2, etc.). Count data with higher means tend to be normally distributed and you can often use OLS. However, count data with smaller means can be skewed, and linear regression might have a hard time fitting these data. For these cases, there are several types of models you can use.

Poisson regression

Count data frequently follow the Poisson distribution, which makes Poisson Regressiona good possibility. Poisson variables are a count of something over a constant amount of time, area, or another consistent length of observation. With a Poisson variable, you can calculate and assess a rate of occurrence. A classic example of a Poisson dataset is provided by Ladislaus Bortkiewicz, a Russian economist, who analyzed annual deaths caused by horse kicks in the Prussian Army from 1875-1984.

Use Poisson regression to model how changes in the independent variables are associated with changes in the counts. Poisson models are similar to logistic models because they use Maximum Likelihood Estimation and transform the dependent variable using the natural log. Poisson models can be suitable for rate data, where the rate is a count of events divided by a measure of that unit’s exposure (a consistent unit of observation). For example, homicides per month.

Example: An analyst uses Poisson regression to model the number of calls that a call center receives daily.

Alternatives to Poisson regression for count data

Not all count data follow the Poisson distribution because this distribution has some stringent restrictions. Fortunately, there are alternative analyses you can perform when you have count data.

Negative binomial regression: Poisson regression assumes that the variance equals the mean. When the variance is greater than the mean, your model has overdispersion. A negative binomial model, also known as NB2, can be more appropriate when overdispersion is present.

Zero-inflated models: Your count data might have too many zeros to follow the Poisson distribution. In other words, there are more zeros than the Poisson regression predicts. Zero-inflated models assume that two separate processes work together to produce the excessive zeros. One process determines whether there are zero events or more than zero events. The other is the Poisson process that determines how many events occur, some of which some can be zero. An example makes this clearer!

Suppose park rangers count the number of fish caught by each park visitor as they exit the park. A zero-inflated model might be appropriate for this scenario because there are two processes for catching zero fish:

  • Some park visitors catch zero fish because they did not go fishing.
  • Other visitors went fishing, and some of these people caught zero fish.

Whew! That’s many different types of regression analysis! If you’re trying to figure out which one to choose, I hope you will use this information to point yourself in the right direction!

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Logistic regression is a powerful statistical way of modeling a binomial outcome (takes the value 0 or 1 like having or not having a disease) with one or more explanatory variables.

ADVANTAGES

I can see two main advantages of logistic regression over Chi2 or Fischer's exact test. The first is you can include more than one explanatory variable (dependent variable) and those can either be dichotomous, ordinal, or continuous. The second is that logistic regression provides a quantified value for the strength of the association adjusting for other variables (removes confounding effects). The exponential of coefficients correspond to odd ratios for the given factor.

DISADVANTAGE

1) You need enough participants with each possible set of explanatory variable. Using interaction or adding factors that a rare therefore reduce considerably the power of the analysis. This has to be carefully considered at the planning phase to make sure the sample size is large enough.

2) If you are using a dependent variable that is not binomial, you need to test the assumption of linearity before including it in the model. This is possible by first creating dummy variables for each value of an ordinal variable or by cutting down a continuous variable in different categories, and then using them as dummy variables. Likelihood ratio test can then be used to test if the model assuming linearity is similar to the one not assuming it. This has the major advantage of increasing the power of your analysis. It can require some transformation.

3) Logistic regression combines both binomial and normal distribution. This can sometimes cause problems. Quadrature check can be used to verify that these problems did not occur. Relative differences must be bellow 0.01 (1%) for all parameters.

4) Defining variables to enter in the model, adding, or removing explanatory variables can be complicated and must be carefully planned. Avoid important collinearity between variables as this will cause over-adjustement. Identify potential candidates using univariate analysis with a p-value threshold above the one you wish to use at the end as negative confounding can occur. When necessary consider introducing interaction terms if you are to believe some factors might increase the effects of others on your outcome.


Related Solutions

(a) Write down the overall model form if one wishes to build a second order model...
(a) Write down the overall model form if one wishes to build a second order model for each value of the qualitative variable [5 points] (c) Build a regression model showing the 90% confidence ranges of the regression parameters. Write down the mean estimates of the regression parameters for the model in (a) (d) Write down the 90% bounds of the estimate of the y-intercept (constant term) [2 points] (e) Compute the model prediction for a bulb with a dirty...
Discussion Question - There is an ongoing debate about the roles of quantitative and qualitative inputs...
Discussion Question - There is an ongoing debate about the roles of quantitative and qualitative inputs in demand estimation and forecasting. Those in the qualitative camp argue that statistical analysis can only go so far. Demand estimates can be further improved by incorporating purely qualitative factors. Quantitative advocates insist that qualitative, intuitive, holistic approaches only serve to introduce errors, biases, and extraneous factors into the estimation task. Suppose the executive for the theater chain is convinced that any number of...
Create a form with two inputs name and roll number.And write a script to validate the...
Create a form with two inputs name and roll number.And write a script to validate the inputs.Any of them should not be empty. Name will be string and roll number will be number between 1 -10 only
Write a quantitative and a qualitative research question for each of the following concepts: (Remember for...
Write a quantitative and a qualitative research question for each of the following concepts: (Remember for the quantitative question you have to have an independent and a dependent variable) 1. Patient comfort 2. Elevated blood pressure 3. Caregiver stress
write an article with regression model
write an article with regression model
Estimate the demand curve using regression analysis. Write down the equational form. Interpret the coefficients, statistical...
Estimate the demand curve using regression analysis. Write down the equational form. Interpret the coefficients, statistical significance and R2. What are the limitations of your specification (omitted variables, correlation vs. causality)? Quantity Price 84 59 80 65 85 54 83 61 81 64 84 58 87 48 78 68 82 63 76 70 79 65 75 80
Estimate the demand curve using regression analysis. Write down the equational form. Interpret the coefficients, statistical...
Estimate the demand curve using regression analysis. Write down the equational form. Interpret the coefficients, statistical significance and R2. What are the limitations of your specification (omitted variables, correlation vs. causality)? Quantity Price 84 59 80 65 85 54 83 61 81 64 84 58 87 48 78 68 82 63 76 70 79 65 75 80
Estimate the demand curve using regression analysis. Write down the equational form. Interpret the coefficients, statistical...
Estimate the demand curve using regression analysis. Write down the equational form. Interpret the coefficients, statistical significance and R2. What are the limitations of your specification (omitted variables, correlation vs. causality)? Quantity Price 84 59 80 65 85 54 83 61 81 64 84 58 87 48 78 68 82 63 76 70 79 65 75 80
Describe the stage, and the quantitative and qualitative analysis to one of the following WSJ cases....
Describe the stage, and the quantitative and qualitative analysis to one of the following WSJ cases. Describe the strength and weakness of each analysis in the case from the investors and business perspective. Investors Grapple With Coronavirus Impact on Largest U.S. Firms
[C++ Coding question] write a program which inputs data to a two-dimensional array: - Input number...
[C++ Coding question] write a program which inputs data to a two-dimensional array: - Input number of rows r aand number of columns c - Create a two-dimensional array of integers int numbers[r][c] -input data for each element of numbers Your program must compute and display the largest number is each row and column of the number array
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT