In: Statistics and Probability
In two wards for elderly patients in a local hospital the following levels of hemoglobin (grams per liter) were found for a simple random sample of patients from each ward.: Ward A: ```{r} ward_a <- c(12.2, 11.1, 14.0, 11.3, 10.8, 12.5, 12.2, 11.9, 13.6, 12.7, 13.4, 13.7) ``` Ward B: ```{r} ward_b <- c(11.9, 10.7, 12.3, 13.9, 11.1, 11.2, 13.3, 11.4, 12.0, 11.1) ``` 2.a) [2 points] Make two box plots to compare the hemoglobin values for Ward A and Ward B. Overlay the boxplots with their raw data. Notice the similarities/differences portrayed by the plots, keeping in mind that the sample size is relatively small for these two wards. ```{r make-data-frame} hemoglobin <- data.frame(hemo_level = c(ward_a, ward_b), ward = c(rep("Ward A", 12), rep("Ward B", 10))) ``` ```{r make-box-plot} # Your code here. ``` 2.b) [2 points] What two assumptions do you need to make to use any of the t-procedures? Because each ward has a rather small sample size (n < 12 for both), what two characteristics of the data would you need to check for to ensure that the t procedures can be applied? Assumtion 1: Assumption 2: 2.c) [4 points] Using only `dplyr` and `*t` functions, create a 95% confidence interval for the mean difference between Ward A and Ward B. You can do this by using `dplyr` to calculate the inputs required to calculate the 95% CI, and then plugging these values in on a separate line of code (or using your calculator). Use a degrees of freedom of 19.515 (You don't need to calculate the degrees of freedom, you can use this value directly). Show your work and interpret the mean difference and its 95% CI. ```{r} # Your code here. ``` Write your 1-2 sentence answer here. 2.d) [4 points] Perform a two-sided t-test for the difference between the two samples. Start by writing down the null and alternate hypotheses, then calculate the test statistic by hand (showing your work) and p-value. Continue to assume that the degrees of freedom is 19.515. Verify the p-value by running the t-test using R's built in function. Show the output from that test. Hint: to perform the t-test using R's built in function, you need to pass the function an x and y argument, where x includes that values for Ward A and Y includes the values for Ward B. `dplyr`'s `filter()` and `pull()` functions will be your friends. $H_{0}$ $H_{A}$ Test statistic = P-value = ```{r} # Your t-test code here. ```
R-commands and outputs:
ward_a <- c(12.2, 11.1, 14.0, 11.3, 10.8, 12.5, 12.2, 11.9,
13.6, 12.7, 13.4, 13.7)
ward_b <- c(11.9, 10.7, 12.3, 13.9, 11.1, 11.2, 13.3, 11.4,
12.0, 11.1)
### Box plots to compare the haemoglobin values for Ward A and
Ward B.
> summary(ward_a)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.80 11.75 12.35 12.45 13.45 14.00
> summary(ward_b)
Min. 1st Qu. Median Mean 3rd Qu. Max.
10.70 11.12 11.65 11.89 12.22 13.90
par(mfrow=c(1,2))
boxplot(ward_a,main="Ward A")
boxplot(ward_b,main="Ward B")
### From the boxplot, we observe that: Median
haemoglobin of ward_a is greater than that of ward_b.
### Distribution for both wards is different.
### Range is almost SAME for both the wards. [Range = maximum value
- minimum value]
### Moreover, boxplot of ward B shows that upper whisker is more
extended indicating SKEWNESS present in the data.
hemoglobin <- data.frame(hemo_level = c(ward_a, ward_b), ward
= c(rep("Ward A", 12), rep("Ward B", 10)))
hemoglobin
hemo_level ward
1 12.2 Ward A
2 11.1 Ward A
3 14.0 Ward A
4 11.3 Ward A
5 10.8 Ward A
6 12.5 Ward A
7 12.2 Ward A
8 11.9 Ward A
9 13.6 Ward A
10 12.7 Ward A
11 13.4 Ward A
12 13.7 Ward A
13 11.9 Ward B
14 10.7 Ward B
15 12.3 Ward B
16 13.9 Ward B
17 11.1 Ward B
18 11.2 Ward B
19 13.3 Ward B
20 11.4 Ward B
21 12.0 Ward B
22 11.1 Ward B
### Assumptions to use any of the
t-procedures
Assumption 1: The assumption for a t-test is that the scale of
measurement of the data collected must be a continuous or ordinal
scale.
Assumption 2: the data is collected "randomly" from population;
generally a simple random sample.
Assumption 3: There is Homogeneity of variance. [Homogeneous, or
equal, variance exists when the standard deviations of samples are
approximately equal.]
Assumption 4: For large n,the data must be approximately normally
distributed.
### We check homogeneity
sd(ward_a)
#[1] 1.068133
sd(ward_b)
#[1] 1.032204
### Standard deviations for both wards are similar.
### Normality test
### H0: data are normally distributed
> shapiro.test(ward_a)
Shapiro-Wilk normality test
data: ward_a
W = 0.94815, p-value = 0.6102
> shapiro.test(ward_b)
Shapiro-Wilk normality test
data: ward_b
W = 0.89842, p-value = 0.2105
### Clearly, p-value for both the wards is greater than alpha=0.05.
We Accept H0.
### Both wards' data are normally distributed.
### Thus, assumptions of t-test are satisfied.
### t-test
## H0: mean of ward A=mean of ward B i.e. (mean of ward_a-mean of
ward_b=0)
## H0: d=mean of ward_a-mean of ward_b=0 where, d=difference
## Null Hypothesis--H0: d=0 [The difference(d) between the two
samples is not significant.]
## Alternative Hypothesis--H1: d not equal to zero.
> t.test(ward_a,ward_b,alternative =
c("two.sided"))
Welch Two Sample t-test
data: ward_a and ward_b
t = 1.2472, df = 19.515, p-value = 0.2271
alternative hypothesis: true difference in means is not equal to
0
95 percent confidence interval:
-0.3781372 1.4981372
sample estimates:
mean of x mean of y
12.45 11.89
## 95% CI is given as: [-0.3781372, 1.4981372]
## p-value = 0.2271 which is greater than alpha=0.05. We accept
H0.
##Computation step-by-step
wA=ward_a[1:10]
di=(wA-ward_b) #vector of differences
di
[1] 0.3 0.4 1.7 -2.6 -0.3 1.3 -1.1 0.5 1.6 1.6
dbar=mean(di)
dbar
[1] 0.34
S2=var(di)
S2
[1] 1.900444
test=dbar/(sqrt(S2/10))
test
[1] 0.7799223
alpha=0.05
qt(1-alpha/2,df=19.515)
[1] 2.089292
dt(test,df=19.515)
[1] 0.2874845
pt(test,df=19.515)
[1] 0.7776032
## Calculated value of statistic=test=0.7799223
## tabulated value of statistic=2.089292
## As calculated value is less than tabulated value, we
accept H0.
## The difference(d) between the two samples is not
significant.