Question

In: Statistics and Probability

In this assignment, we will use the dataset collected in Baystate Medical Center, Springfield, Mass (1986),...

In this assignment, we will use the dataset collected in Baystate Medical Center, Springfield, Mass (1986), featured in Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. New York: Wiley. To download this dataset, go into R and run the following code to save the data as the object “data1”.

install.packages("MASS")

library(MASS)

data1 <- birthwt

Description of birthwt Data

low indicator of birth weight less than 2.5 kg.

age mother's weight in pounds at the last menstrual period.

lwt mother's weight in pounds at the last menstrual period.

race mother's race (1 = white, 2 = black, 3 = others).

smoke smoking status during pregnancy.

ptl the number of previous premature labours.

ht history of hypertension.

ui presence of uterine irritability.

ftv the number of physician visits during the first trimester.

bwt birth weight in grams.

Use R to help construct a final (best fit) model for the birth weight data, with bwt as the Y variable.
(i) Write down the model equation.
(ii) Use R to plot studentized residuals against predicted values and X variables. Discuss what you see.

(b) Verify that regression model assumptions are met. Do you think there is a need to do any adjustments/transformations? If yes, what do you suggest? (

Expert Solution

a)

i)

Final Regression Equation:

Bwt = 3586.50 -1139.20 * low – 97.34 * race – 157.42* smoke -303.19 ui

ii)

Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.

In our example, there is some pattern in the residual plot. This suggests that we cannot assume linear relationship between the predictors and the outcome variables

b)

The diagnostic plots show residuals in four different ways:

Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.

Normal Q-Q

Above plot shows that the residuals are not normally distributed

Homogeneity of variance

This assumption can be checked by examining the scale-location plot, also known as the spread-location plot.

plot(model2, 3)

It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).

A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).

R Code

library(MASS)

data("birthwt")

data1 <- birthwt

head(data1)

summary(data1)

model = lm(bwt~.,data=data1)

summary(model)

## To find best model for all the combination of regression independent variables

step(model, direction = "backward",trace = FALSE)

model2 = lm(bwt~low + race + smoke + ui,data=data1)

summary(model2)

plot(model2)

Output

> library(MASS)

> data("birthwt")

> data1 <- birthwt

> head(data1)

low age lwt race smoke ptl ht ui ftv bwt

85 0 19 182 2 0 0 0 1 0 2523

86 0 33 155 3 0 0 0 0 3 2551

87 0 20 105 1 1 0 0 0 1 2557

88 0 21 108 1 1 0 0 1 2 2594

89 0 18 107 1 1 0 0 1 0 2600

91 0 21 124 3 0 0 0 0 0 2622

> summary(data1)

low age lwt race smoke

Min. :0.0000 Min. :14.00 Min. : 80.0 Min. :1.000 Min. :0.0000

1st Qu.:0.0000 1st Qu.:19.00 1st Qu.:110.0 1st Qu.:1.000 1st Qu.:0.0000

Median :0.0000 Median :23.00 Median :121.0 Median :1.000 Median :0.0000

Mean :0.3122 Mean :23.24 Mean :129.8 Mean :1.847 Mean :0.3915

3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:140.0 3rd Qu.:3.000 3rd Qu.:1.0000

Max. :1.0000 Max. :45.00 Max. :250.0 Max. :3.000 Max. :1.0000

ptl ht ui ftv bwt

Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. : 709

1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:2414

Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000 Median :2977

Mean :0.1958 Mean :0.06349 Mean :0.1481 Mean :0.7937 Mean :2945

3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:3487

Max. :3.0000 Max. :1.00000 Max. :1.0000 Max. :6.0000 Max. :4990

>

> model = lm(bwt~.,data=data1)

> summary(model)

Call:

lm(formula = bwt ~ ., data = data1)

Residuals:

Min 1Q Median 3Q Max

-991.22 -300.96 -5.39 277.74 1637.80

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3612.508 229.457 15.744 < 2e-16 ***

low -1131.217 73.957 -15.296 < 2e-16 ***

age -6.245 6.347 -0.984 0.326416

lwt 1.051 1.133 0.927 0.355085

race -100.905 38.544 -2.618 0.009605 **

smoke -174.116 72.000 -2.418 0.016597 *

ptl 81.340 68.552 1.187 0.236980

ht -181.955 137.661 -1.322 0.187934

ui -336.776 93.314 -3.609 0.000399 ***

ftv -7.578 30.992 -0.245 0.807118

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 433.7 on 179 degrees of freedom

Multiple R-squared: 0.6632, Adjusted R-squared: 0.6462

F-statistic: 39.16 on 9 and 179 DF, p-value: < 2.2e-16

>

> ## To find best model for all the combination of regression independent variables

> step(model, direction = "backward",trace = FALSE)

Call:

lm(formula = bwt ~ low + race + smoke + ui, data = data1)

Coefficients:

(Intercept) low race smoke ui

3586.50 -1139.20 -97.34 -157.42 -303.19

>

> model2 = lm(bwt~low + race + smoke + ui,data=data1)

> summary(model2)

Call:

lm(formula = bwt ~ low + race + smoke + ui, data = data1)

Residuals:

Min 1Q Median 3Q Max

-1025.8 -351.0 30.8 285.8 1500.8

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3586.50 86.68 41.379 < 2e-16 ***

low -1139.20 71.12 -16.019 < 2e-16 ***

race -97.34 37.37 -2.605 0.009942 **

smoke -157.42 70.38 -2.237 0.026510 *

ui -303.19 90.02 -3.368 0.000922 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 432.5 on 184 degrees of freedom

Multiple R-squared: 0.6557, Adjusted R-squared: 0.6482

F-statistic: 87.59 on 4 and 184 DF, p-value: < 2.2e-16

orchestra answered 2 years ago

In this assignment, you will be required to use the Heart Rate Dataset to complete the...

In this assignment, you will be required to use the Heart Rate Dataset to complete the following: Identify the types of data represented by variables The range or types of values for each variable Give a brief written description of the variables, and how they are used in the data set. Steps Open the Heart Rate Dataset in Excel Identify each of the variables contained in the dataset by type Identify the type of data each variable represents (e.g., qualitative...

Use the following linear regression equation regarding airline tickets to answer the question. (The dataset collected...

Use the following linear regression equation regarding airline tickets to answer the question. (The dataset collected for Distance was from 500 miles to 5,687 miles) Note: that Distance is the number of miles between the departure and arrival cities, and Price is the cost in dollars of an airline ticket. (a) Find the slope using the linear regression equation given to you above. Inter- pret the value that you got for the slope in the context of the problem. Predicted...

Solve it by R Use the ‘cement’ dataset in ‘MASS’ package to answer the question. (1)...

Solve it by R Use the ‘cement’ dataset in ‘MASS’ package to answer the question. (1) Conduct the multiple linear regression, regress y value on x1, x2, x3 and x4 (without intercept). Report the estimated coefficients. Which predictor variables have strong linear relationship with response variable y at significance level 0.05? (2) What is the adjusted R square of your regression? What is the interquartile range (IQR) of the residuals from your regression? (3) Conduct a best subset regression (with...

The below information are collected from Mawelleh Center Souq , use the information to solve the...

The below information are collected from Mawelleh Center Souq , use the information to solve the following questions (Average Daily Sale Amount per kilo gram) : Food Omani Production External Production TOTAL Banana 85 120 205 Pineapple 40 75 115 TOTAL 125 195 320 (a) If one food is chosen, what is the probability the food is Omani production or Pineapple ? (b) If one food is chosen , what is the probability the food banana giving that it is...

data=(1,7,3,4,5,6,2,8,9,10,11,12,13,14,15,16,17,18,19,20) Suppose your dataset is a sample collected from some population with variance = 10. Use...

data=(1,7,3,4,5,6,2,8,9,10,11,12,13,14,15,16,17,18,19,20) Suppose your dataset is a sample collected from some population with variance = 10. Use R to solve the problems (b) (2pts) Determine a (96)% confidence interval for the population mean. (d) (3pts) Conduct a (92)% confidence level hypothesis test to test if the population mean is larger than 5 using Pvalue. (g) (2pts) Determine the probability of Type II Error for the test in (d) if the population mean is actually 10.

1. Load the cpus dataset from the MASS package. Use syct, mmin , mmax , cach...

1. Load the cpus dataset from the MASS package. Use syct, mmin , mmax , cach , chmin, chmax as the predictors (independent variables) to predict performance (perf) Perform the best subset selection in order to choose the best predictors from the above predictors. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained for each criterion. Repeat using...

How Can We Reach For The Five Stars at wise Medical Center?” For over 70 years...

How Can We Reach For The Five Stars at wise Medical Center?” For over 70 years wise Medical Center has provided patients with high-quality, reliable care. A full-service, acute-care facility, we pride ourselves on making our culturally-diverse patients feel at ease.Now is ranked with 3 stars Clinical Services Cardiopulmonary Critical Care Emergency Services Maternity and Child Services with NICU Medical and Surgical Onsite Cath Lab Radiology Rehabilitation STEMI Receiving Center Each essay should contain all HCAHPS Components: i.e. Nursing and...

The York College Medical Center has launched a new Cosmetic and Reconstructive Services Department. We are...

The York College Medical Center has launched a new Cosmetic and Reconstructive Services Department. We are trying to compete with our fellow medical centers by tapping into the patients who might want or need these services. Instead of going to NorthWell or Mt. Sinai, we want them to come locally, to their neighborhood medical center. We are local, speak their language, state of the industry, clean, kind, experts and all the other superlatives one looks for in healthcare. We’ve done...

In this assignment, we will explore some simple expressions and evaluate them. We will use an...

In this assignment, we will explore some simple expressions and evaluate them. We will use an unconventional approach and severely limit the expressions. The focus will be on operator precedence. The only operators we support are logical or (|), logical and (&), less than (<), equal to (=), greater than (>), add (+), subtract (-), multiply (*) and divide (/). Each has a precedence level from 1 to 5 where higher precedence operators are evaluated first, from left-to-right. For example,...

To all York College Medical Center Employees and Trustees: We have been given a generous grant...

To all York College Medical Center Employees and Trustees: We have been given a generous grant by an anonymous donor. The terms of the grant stipulate that the money can be spent to forward the “spirit” of the York College Medical Center’s mission, especially if it benefits the staff and community as a whole. No other specific instructions were given, meaning that as a community, we can use the money as we please. The Board of Directors have decide on...

Question

In this assignment, we will use the dataset collected in Baystate Medical Center, Springfield, Mass (1986),...

Solutions

Expert Solution

Related Solutions

In this assignment, you will be required to use the Heart Rate Dataset to complete the...

Use the following linear regression equation regarding airline tickets to answer the question. (The dataset collected...

Solve it by R Use the ‘cement’ dataset in ‘MASS’ package to answer the question. (1)...

The below information are collected from Mawelleh Center Souq , use the information to solve the...

data=(1,7,3,4,5,6,2,8,9,10,11,12,13,14,15,16,17,18,19,20) Suppose your dataset is a sample collected from some population with variance = 10. Use...

1. Load the cpus dataset from the MASS package. Use syct, mmin , mmax , cach...

How Can We Reach For The Five Stars at wise Medical Center?” For over 70 years...

The York College Medical Center has launched a new Cosmetic and Reconstructive Services Department. We are...

In this assignment, we will explore some simple expressions and evaluate them. We will use an...

To all York College Medical Center Employees and Trustees: We have been given a generous grant...