In: Statistics and Probability
Using SAS is preferred but not required. Please provide SAS code and the relevant output if SAS is used.
• Link to referenced FEV.CSV: https://drive.google.com/open?id=1t1CRIbnTE7xL_OE9Bmajb564RXcg8nDo
For all hypothesis testing problems:
• state the null and alternative hypotheses,
• calculate the value of the test statistic,
• determine if the results are statistically significant (using rejection region or p-value approaches),
• then state your conclusion in terms of the problem.
1. FEV (forced expiratory volume) is an index of pulmonary function that measures the volume of air expelled after one second of constant effort. The data fev.csv contains determinations of FEV on 654 children ages 3-19 who were seen in the Childhood Respiratory Disease Study in East Boston, Massachusetts. The variables in the data include:
ID: subject ID number
Age: age in years
FEV: FEV in liters
Height: height in inches
Sex: Male or Female
Smoker: non = nonsmoker, Current = current smoker
a. Make a boxplot of the FEV for children with age 3-8 years, 9-12 years, and 13 years or above. Does it appear that the FEV is the same for children from these three age groups?
b. Is FEV the same across the three age groups? Perform a hypothesis test to answer the question. Use ? = 0.05.
c. Rank the three age groups using the multiple comparison approach.
d. What assumptions are made with regard to the analysis in part b? Check whether these assumptions are violated.
e. Is FEV is more strongly related to sex or smoking status? Carry out appropriate statistical analysis to answer the question.
f. The investigator is also interested in how height is associated with age. Construct the scatter plot of height against age. What is the relationship between height and age?
g. Regardless of what you observed in f, fit the regression model with height as the response and age as the independent variable. What is the fitted regression equation?
h. Test whether there is a positive correlation between age and height. Perform the hypothesis test using ? = 0.05.
i. Is it appropriate to use the above regression model? Why or why not?
a. Make a boxplot of the FEV for children with age 3-8 years, 9-12 years, and 13 years or above. Does it appear that the FEV is the same for children from these three age groups?
I have used EXCEL to construct box plot of the FEV for children with age 3-8 years, 9-12 years, and 13 years or above. The first step to sort the data from youngest to oldest age group. Divide the data into three groups age 3-8 years, 9-12 years, and 13 years or above. Insert > Box & Whisker
No, it doesn't seem that FEV is same for children from these three age groups.
Findings from boxplot:
b. Is FEV the same across the three age groups? Perform a hypothesis test to answer the question. Use alpha = 0.05.
To find the difference between the three age groups we performed ANOVA analysis on the data
The null and alternative hypotheses:
Calculate the value of the test statistic:
EXCEL > DATA> Data Analysis (AddIn) > ANOVA Single factor
Anova: Single Factor | ||||||
SUMMARY | ||||||
Groups | Count | Sum | Average | Variance | ||
FEV Age 3-8 | 215 | 399.519 | 1.858228 | 0.176462 | ||
FEV Age 9-12 | 322 | 903.802 | 2.806839 | 0.410088 | ||
FEV Age >=13 | 117 | 421.133 | 3.599427 | 0.633303 | ||
ANOVA | ||||||
Source of Variation | SS | df | MS | F | P-value | F crit |
Between Groups | 248.0558 | 2 | 124.0279 | 332.4582 | 3.2E-100 | 3.00956 |
Within Groups | 242.8641 | 651 | 0.373063 | |||
Total | 490.9198 | 653 |
F(2,651) = 332.4582, p = 3.2E-100
Determine if the results are statistically significant (using rejection region or p-value approaches),
P = 3.2E-100 which is highly significant and F value (332.4582) is way above F critical value of 3.00956
State your conclusion in terms of the problem
Since ANOVA shows that the between-group difference is significant thus we reject the null hypothesis and accept the alternate hypothesis that there is a significant difference in FEV values between the age groups.
c. Rank the three age groups using the multiple comparison approaches.
Do Not function available in EXCEL for post host tests like Tukey etc. So I have perfomed paired t test for each pair
3-8 vs 9-12 p = 8.64E-45
9-12 vs >=13 p = 2.47E-23
3-8 vs >=13 p = 2.57E-50
All three pair are significant
d. What assumptions are made with regard to the analysis in part b? Check whether these assumptions are violated.
There are three main assumptions, listed here:
The dependent variable is normally distributed in each group that is being compared in the one-way ANOVA
There is the homogeneity of variances. This means that the population variances in each group are equal.
Independence of observations. This is mostly a study design issue and, as such, you will need to determine whether you believe it is possible that your observations are not independent based on your study design
We carried out the descriptive analysis of the FEV in three age groups. The result obtained is as follows
FEV Age 3-8 | FEV Age 9-12 | FEV Age >=13 | |||
Mean | 1.858228 | Mean | 2.806839 | Mean | 3.59942735 |
Standard Error | 0.028649 | Standard Error | 0.035687 | Standard Error | 0.07357204 |
Median | 1.79 | Median | 2.756 | Median | 3.519 |
Mode | 1.624 | Mode | 2.352 | Mode | 3.297 |
Standard Deviation | 0.420073 | Standard Deviation | 0.640381 | Standard Deviation | 0.795803284 |
Sample Variance | 0.176462 | Sample Variance | 0.410088 | Sample Variance | 0.633302868 |
Kurtosis | -0.0875 | Kurtosis | 0.804245 | Kurtosis | -0.227353716 |
Skewness | 0.242282 | Skewness | 0.693698 | Skewness | 0.411303102 |
Range | 2.202 | Range | 3.766 | Range | 3.595 |
Minimum | 0.791 | Minimum | 1.458 | Minimum | 2.198 |
Maximum | 2.993 | Maximum | 5.224 | Maximum | 5.793 |
Sum | 399.519 | Sum | 903.802 | Sum | 421.133 |
Count | 215 | Count | 322 | Count | 117 |
Confidence Level(95.0%) | 0.05647 | Confidence Level(95.0%) | 0.07021 | Confidence Level(95.0%) | 0.145718695 |
Shapiro-Wilk Test | |||
FEV Age 3-8 | FEV Age 9-12 | FEV Age >=13 | |
W-stat | 0.988854391 | 0.97202266 | 0.977644226 |
p-value | 0.093440645 | 6.72808E-06 | 0.047848341 |
alpha | 0.05 | 0.05 | 0.05 |
normal | yes | no | no |
We also performed Shapiro Wilks test RealStats(AddIn) in EXCEL. Which shows that two of the age groups is not normal.
Sample variance of three groups are different in numerical value.
e. Is FEV is more strongly related to sex or smoking status? Carry out appropriate statistical analysis to answer the question.
Using EXCEL > AddIn > RealStats > Data Analysis > Corr > Correlation test
Carried correlation test between FEV and SEX and FEV and SMOKING
RESULT of correlation test on FEV and SMOKING
Correlation Coefficients | ||||
Pearson | -0.245424571 | |||
Spearman | -0.258349236 | |||
Kendall | -0.211145277 | |||
Pearson's coeff (t test) | Pearson's coeff (Fisher) | |||
Alpha | 0.05 | Rho | 0 | |
Tails | 2 | Alpha | 0.05 | |
Tails | 2 | |||
corr | -0.245424571 | |||
std err | 0.037965248 | corr | -0.245424571 | |
t | -6.464453173 | std err | 0.039133024 | |
p-value | 1.99285E-10 | z | -6.392409034 | |
lower | -0.319973478 | p-value | 1.63292E-10 | |
upper | -0.170875664 | lower | -0.316142406 | |
upper | -0.171994479 |
RESULT of correlation test on FEV and SEX
Correlation Coefficients | ||||
Pearson | -0.20841 | |||
Spearman | -0.14364 | |||
Kendall | -0.11739 | |||
Pearson's coeff (t test) | Pearson's coeff (Fisher) | |||
Alpha | 0.05 | Rho | 0 | |
Tails | 2 | Alpha | 0.05 | |
Tails | 2 | |||
corr | -0.20841 | |||
std err | 0.038303 | corr | -0.208414959 | |
t | -5.44121 | std err | 0.039133024 | |
p-value | 7.5E-08 | z | -5.39671039 | |
lower | -0.28363 | p-value | 6.78738E-08 | |
upper | -0.1332 | lower | -0.28059776 | |
upper | -0.13388797 |
Conclusion:
f. The investigator is also interested in how height is associated with age. Construct the scatter plot of height against age. What is the relationship between height and age?
The height and Age have a linear relationship with R square valued of 0.6272
g. Regardless of what you observed in f, fit the regression model with height as the response and age as the independent variable. What is the fitted regression equation?
Regression Analysis of height as response and Age as the independent variable
SUMMARY OUTPUT | ||||||
Regression Statistics | ||||||
Multiple R | 0.791943602 | |||||
R Square | 0.627174669 | |||||
Adjusted R Square | 0.626602851 | |||||
Standard Error | 3.485201717 | |||||
Observations | 654 | |||||
ANOVA | ||||||
df | SS | MS | F | Significance F | ||
Regression | 1 | 13322.52461 | 13322.52461 | 1096.808 | 8.0598E-142 | |
Residual | 652 | 7919.603418 | 12.14663101 | |||
Total | 653 | 21242.12803 | ||||
Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | |
Intercept | 45.95779737 | 0.478358112 | 96.07404205 | 0 | 45.01848904 | 46.89710571 |
Age | 1.529099387 | 0.046171116 | 33.1180949 | 8.1E-142 | 1.438437365 | 1.619761409 |
The regression equation will be
height = 45.95 + 1.52*Age
Thus for one every unit increase in Age, height will increase with 1.52. The regression analysis is significant with p <0.00000001. The coefficient of determination, R square = 0.62 which implies 62% variance in height is explained by Age.
h.Test whether there is a positive correlation between age and height. Perform the hypothesis test using alpha = 0.05.
Correlation test on height and age
Correlation Coefficients | ||||
Pearson | 0.791973295 | |||
Spearman | 0.818796185 | |||
Kendall | 0.660161965 | |||
Pearson's coeff (t test) | Pearson's coeff (Fisher) | |||
Alpha | 0.05 | Rho | 0 | |
Tails | 2 | Alpha | 0.05 | |
Tails | 2 | |||
corr | 0.791973295 | |||
std err | 0.023929566 | corr | 0.791973295 | |
t | 33.09601623 | std err | 0.039163022 | |
p-value | 0 | z | 27.450651 | |
lower | 0.744984849 | p-value | 6.8244E-166 | |
upper | 0.838961742 | lower | 0.761521493 | |
upper | 0.818936333 | |||
The Pearson correlation coefficient is 0.79 which depicts a strong positive correlation.