In: Statistics and Probability
You are a data science consultant! In each of the following cases, decide whether you would suggest a flexible regression model or an inflexible one. Provide your reasons as clearly as possible.
(a) In the study of breast cancer, a scientist is trying to find the genes associated with breast cancer. The total number of genes in the study is 50,000 and the number of patients is 120.
(b) The Ministry of Education in a certain country wants to identify students who need extra help. They wish to design a system which estimates student performance in the final 8th grade math exam based on their math, science and history grades in the 7th grade. To do this, they want to run a regression on the data from all the students who have graduated from the 8th grade in the last 10 years.
(c) Kelly is a very hardworking chemistry student and she has run an experiment to find a mathematical expression that relates the speed of corrosion of iron to the humidity and temperature of the environment, and the percentage of different elements in the alloy. Unfortunately, the lab that she is working in was established in 1967 and the equipment has not been changed since then. This has caused measurements to vary significantly between different experimental runs, even when the parameters were the same. She is skeptical about the quality of her measurements of the speed of corrosion.
(d) Kelly’s advisor won the Nobel prize in chemistry and used the prize money to outfit the lab with the most modern equipment. Kelly ran her experiments again with the new equipment and now she can trust her numbers. However, her advisor believes that she should not expect that the real relationship be linear.
(a) A flexible regression model will be more convenient to use because it is possible that the standard conditions under which the inflexible regression model is defined such as linearity, normal errors or homoskedasticity might not be satisfied. A flexible regression model will help us relax those assumptions.
(b) An inflexible regression model will be appropriate to model the given situation because certain assumptions made under the inflexible model are plausible such has normality of errors, linearity of the relationship between the independent and the dependent variable and homoskedasticity in the given situation.
(c) A flexible regression model will be used here because it is clearly stated in the given condition that because the equipment is old there was significant variation in the measurements between different experimental runs even when the parameters were same and this means that we cannot assume that the errors in the observations were normally distributed. Hence, we will have to allow for this in the regression model and thus it will be appropriate to choose a flexible regression model.
(d) A flexible regression model is still plausible because even though now the errors can be considered to be normally distributed but the assumption of linearity is not satisfied for the relationship between the dependent and the independent variable. An inflexible regression model assumes linearity of the relationship between the dependent and the independent variable hence it might be appropriate to use in the given situation.