Question

In: Statistics and Probability

In this assignment, we will use the dataset collected in Baystate Medical Center, Springfield, Mass (1986),...

In this assignment, we will use the dataset collected in Baystate Medical Center, Springfield, Mass (1986), featured in Hosmer, D.W. and Lemeshow, S. (1989) Applied Logistic Regression. New York: Wiley. To download this dataset, go into R and run the following code to save the data as the object “data1”.

install.packages("MASS")

library(MASS)

data1 <- birthwt

Description of birthwt Data

low indicator of birth weight less than 2.5 kg.

age mother's weight in pounds at the last menstrual period.

lwt mother's weight in pounds at the last menstrual period.

race mother's race (1 = white, 2 = black, 3 = others).

smoke smoking status during pregnancy.

ptl the number of previous premature labours.

ht history of hypertension.

ui presence of uterine irritability.

ftv the number of physician visits during the first trimester.

bwt birth weight in grams.

Use R to help construct a final (best fit) model for the birth weight data, with bwt as the Y variable.
(i) Write down the model equation.
(ii) Use R to plot studentized residuals against predicted values and X variables. Discuss what you see.

(b) Verify that regression model assumptions are met. Do you think there is a need to do any adjustments/transformations? If yes, what do you suggest? (

Solutions

Expert Solution

a)

i)

Final Regression Equation:

Bwt = 3586.50 -1139.20 * low – 97.34 * race – 157.42* smoke -303.19 ui

ii)

Ideally, the residual plot will show no fitted pattern. That is, the red line should be approximately horizontal at zero. The presence of a pattern may indicate a problem with some aspect of the linear model.

In our example, there is some pattern in the residual plot. This suggests that we cannot assume linear relationship between the predictors and the outcome variables

b)

The diagnostic plots show residuals in four different ways:

  1. Residuals vs Fitted. Used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good.
  2. Normal Q-Q. Used to examine whether the residuals are normally distributed. It’s good if residuals points follow the straight dashed line.
  3. Scale-Location (or Spread-Location). Used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. This is not the case in our example, where we have a heteroscedasticity problem.
  4. Residuals vs Leverage. Used to identify influential cases, that is extreme values that might influence the regression results when included or excluded from the analysis.

       Normal Q-Q

Above plot shows that the residuals are not normally distributed

Homogeneity of variance

This assumption can be checked by examining the scale-location plot, also known as the spread-location plot.

plot(model2, 3)

It can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).

A possible solution to reduce the heteroscedasticity problem is to use a log or square root transformation of the outcome variable (y).

R Code

library(MASS)

data("birthwt")

data1 <- birthwt

head(data1)

summary(data1)

model = lm(bwt~.,data=data1)

summary(model)

## To find best model for all the combination of regression independent variables

step(model, direction = "backward",trace = FALSE)

model2 = lm(bwt~low + race + smoke + ui,data=data1)

summary(model2)

plot(model2)

Output

> library(MASS)

> data("birthwt")

> data1 <- birthwt

> head(data1)

   low age lwt race smoke ptl ht ui ftv bwt

85   0 19 182    2     0   0 0 1   0 2523

86   0 33 155   3     0   0 0 0   3 2551

87   0 20 105    1     1   0 0 0   1 2557

88   0 21 108    1     1   0 0 1   2 2594

89   0 18 107    1     1   0 0 1   0 2600

91   0 21 124    3     0   0 0 0   0 2622

> summary(data1)

      low              age             lwt             race           smoke      

Min.   :0.0000   Min.   :14.00   Min.   : 80.0   Min.   :1.000   Min.   :0.0000

1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0   1st Qu.:1.000   1st Qu.:0.0000

Median :0.0000   Median :23.00   Median :121.0   Median :1.000   Median :0.0000

Mean   :0.3122   Mean   :23.24   Mean   :129.8   Mean   :1.847   Mean   :0.3915

3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0   3rd Qu.:3.000   3rd Qu.:1.0000

Max.   :1.0000   Max.   :45.00   Max.   :250.0   Max.   :3.000   Max.   :1.0000

      ptl               ht                ui              ftv              bwt     

Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   : 709

1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2414

Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000   Median :2977

Mean   :0.1958   Mean   :0.06349   Mean   :0.1481   Mean   :0.7937   Mean   :2945

3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3487

Max.   :3.0000   Max.   :1.00000   Max.   :1.0000   Max.   :6.0000   Max.   :4990

>

> model = lm(bwt~.,data=data1)

> summary(model)

Call:

lm(formula = bwt ~ ., data = data1)

Residuals:

    Min      1Q Median      3Q     Max

-991.22 -300.96   -5.39 277.74 1637.80

Coefficients:

             Estimate Std. Error t value Pr(>|t|)   

(Intercept) 3612.508    229.457 15.744 < 2e-16 ***

low         -1131.217     73.957 -15.296 < 2e-16 ***

age            -6.245      6.347 -0.984 0.326416   

lwt             1.051      1.133   0.927 0.355085   

race         -100.905     38.544 -2.618 0.009605 **

smoke        -174.116     72.000 -2.418 0.016597 *

ptl            81.340     68.552   1.187 0.236980   

ht           -181.955    137.661 -1.322 0.187934   

ui           -336.776     93.314 -3.609 0.000399 ***

ftv            -7.578     30.992 -0.245 0.807118   

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 433.7 on 179 degrees of freedom

Multiple R-squared: 0.6632, Adjusted R-squared: 0.6462

F-statistic: 39.16 on 9 and 179 DF, p-value: < 2.2e-16

>

> ## To find best model for all the combination of regression independent variables

> step(model, direction = "backward",trace = FALSE)

Call:

lm(formula = bwt ~ low + race + smoke + ui, data = data1)

Coefficients:

(Intercept)          low         race        smoke           ui

    3586.50     -1139.20       -97.34      -157.42     -303.19

>

> model2 = lm(bwt~low + race + smoke + ui,data=data1)

> summary(model2)

Call:

lm(formula = bwt ~ low + race + smoke + ui, data = data1)

Residuals:

    Min      1Q Median      3Q     Max

-1025.8 -351.0    30.8   285.8 1500.8

Coefficients:

            Estimate Std. Error t value Pr(>|t|)   

(Intercept) 3586.50      86.68 41.379 < 2e-16 ***

low         -1139.20      71.12 -16.019 < 2e-16 ***

race          -97.34      37.37 -2.605 0.009942 **

smoke        -157.42      70.38 -2.237 0.026510 *

ui           -303.19      90.02 -3.368 0.000922 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 432.5 on 184 degrees of freedom

Multiple R-squared: 0.6557, Adjusted R-squared: 0.6482

F-statistic: 87.59 on 4 and 184 DF, p-value: < 2.2e-16


Related Solutions

In this assignment, you will be required to use the Heart Rate Dataset to complete the...
In this assignment, you will be required to use the Heart Rate Dataset to complete the following: Identify the types of data represented by variables The range or types of values for each variable Give a brief written description of the variables, and how they are used in the data set. Steps Open the Heart Rate Dataset in Excel Identify each of the variables contained in the dataset by type Identify the type of data each variable represents (e.g., qualitative...
Use the following linear regression equation regarding airline tickets to answer the question. (The dataset collected...
Use the following linear regression equation regarding airline tickets to answer the question. (The dataset collected for Distance was from 500 miles to 5,687 miles) Note: that Distance is the number of miles between the departure and arrival cities, and Price is the cost in dollars of an airline ticket. (a) Find the slope using the linear regression equation given to you above. Inter- pret the value that you got for the slope in the context of the problem. Predicted...
Solve it by R Use the ‘cement’ dataset in ‘MASS’ package to answer the question. (1)...
Solve it by R Use the ‘cement’ dataset in ‘MASS’ package to answer the question. (1) Conduct the multiple linear regression, regress y value on x1, x2, x3 and x4 (without intercept). Report the estimated coefficients. Which predictor variables have strong linear relationship with response variable y at significance level 0.05? (2) What is the adjusted R square of your regression? What is the interquartile range (IQR) of the residuals from your regression? (3) Conduct a best subset regression (with...
The below information are collected from Mawelleh Center Souq , use the information to solve the...
The below information are collected from Mawelleh Center Souq , use the information to solve the following questions (Average Daily Sale Amount per kilo gram) : Food Omani Production External Production TOTAL Banana 85 120 205 Pineapple 40 75 115 TOTAL 125 195 320 (a) If one food is chosen, what is the probability the food is Omani production or Pineapple ? (b) If one food is chosen , what is the probability the food banana giving that it is...
data=(1,7,3,4,5,6,2,8,9,10,11,12,13,14,15,16,17,18,19,20) Suppose your dataset is a sample collected from some population with variance = 10. Use...
data=(1,7,3,4,5,6,2,8,9,10,11,12,13,14,15,16,17,18,19,20) Suppose your dataset is a sample collected from some population with variance = 10. Use R to solve the problems (b) (2pts) Determine a (96)% confidence interval for the population mean. (d) (3pts) Conduct a (92)% confidence level hypothesis test to test if the population mean is larger than 5 using Pvalue. (g) (2pts) Determine the probability of Type II Error for the test in (d) if the population mean is actually 10.
How Can We Reach For The Five Stars at wise Medical Center?” For over 70 years...
How Can We Reach For The Five Stars at wise Medical Center?” For over 70 years wise Medical Center has provided patients with high-quality, reliable care. A full-service, acute-care facility, we pride ourselves on making our culturally-diverse patients feel at ease.Now is ranked with 3 stars Clinical Services Cardiopulmonary Critical Care Emergency Services Maternity and Child Services with NICU Medical and Surgical Onsite Cath Lab Radiology Rehabilitation STEMI Receiving Center Each essay should contain all HCAHPS Components: i.e. Nursing and...
The York College Medical Center has launched a new Cosmetic and Reconstructive Services Department. We are...
The York College Medical Center has launched a new Cosmetic and Reconstructive Services Department. We are trying to compete with our fellow medical centers by tapping into the patients who might want or need these services. Instead of going to NorthWell or Mt. Sinai, we want them to come locally, to their neighborhood medical center. We are local, speak their language, state of the industry, clean, kind, experts and all the other superlatives one looks for in healthcare. We’ve done...
In this assignment, we will explore some simple expressions and evaluate them. We will use an...
In this assignment, we will explore some simple expressions and evaluate them. We will use an unconventional approach and severely limit the expressions. The focus will be on operator precedence. The only operators we support are logical or (|), logical and (&), less than (<), equal to (=), greater than (>), add (+), subtract (-), multiply (*) and divide (/). Each has a precedence level from 1 to 5 where higher precedence operators are evaluated first, from left-to-right. For example,...
To all York College Medical Center Employees and Trustees: We have been given a generous grant...
To all York College Medical Center Employees and Trustees: We have been given a generous grant by an anonymous donor. The terms of the grant stipulate that the money can be spent to forward the “spirit” of the York College Medical Center’s mission, especially if it benefits the staff and community as a whole. No other specific instructions were given, meaning that as a community, we can use the money as we please. The Board of Directors have decide on...
Let’s continue to use the Toluca.txt dataset. We now assume a normal error model: Yi =...
Let’s continue to use the Toluca.txt dataset. We now assume a normal error model: Yi = β0 + β1xi + i (i = 1, . . . , n), i ∼ N (0, σ2 ), where xi = lotSize and Yi = workHrs. a. Obtain an estimate for σ 2 . Let’s call this estimator ˆσ 2 . You can either calculate this using the formula directly or obtain this from the R output using lm. b. Construct a 90%...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT