In: Statistics and Probability
4) We're going to test the same hypothesis four ways. Assume the
people in the dataset in armspanSpring2020.csv are a random sample
of all adults. For each test, report the test statistic and the
p-value. With a 5% significance level, give the conclusion of each
test.
a) Test the hypothesis that the mean difference between armspan and
height it not equal to 0, using the data in armspanSpring2020.csv.
Do this by creating a new variable named diff = (armspan - height).
Perform a one-sample t-test.
b) Test the same hypothesis, but use a two-sample t-test with
paired =TRUE.
c) Test the same hypothesis, but use a two-sample t-test with
paired=FALSE and var.equal=FALSE.
d) Test the same hypothesis, but use a two-sample t-test with
paired=FALSE and var.equal=TRUE.
e) Which test(s) do you think are valid for this situation and
why?
hint: We almost never use the var.equal=TRUE test. Why? Because it
is only valid if the population standard deviations of both
populations are equal. You might be in a situation where you know
this to be true. If so, fine, use it. But usually we don't, in
which case (a) the var.equal=FALSE test will provide more accurate
p-values if the standard deviations are not equal and (b) will
provide pretty accurate p-value if they are. So you can't lose,
really, with the var.equal=FALSE test, but you can lose with it the
other way.
g)Data cleaning. Identify by row number which observations seem in need of cleaning and why you think so. Provide a table. (Hint: consider the "which()" and "identify()" functions.) Provide a graph to justify your identifications.
height | armspan | is.female |
67 | NA | 1 |
70 | 40 | 0 |
64 | 67 | 1 |
71 | 70 | 0 |
72 | 49 | 0 |
62 | 61 | 1 |
72 | 74 | 0 |
71 | 68 | 0 |
63 | 60 | 1 |
69 | 69 | 0 |
67 | 68 | 1 |
63 | 63 | 1 |
60 | 60 | 1 |
66 | 66 | 0 |
61 | 61 | 1 |
69 | 68 | 0 |
65 | 65 | 1 |
72 | 72 | 0 |
70 | 70 | 0 |
73 | 77 | 0 |
65 | 61 | 1 |
68 | 72 | 1 |
62 | 55 | NA |
71 | 74 | 0 |
72 | 70 | 0 |
66 | 22 | 1 |
65 | 67 | 1 |
64 | 62 | 0 |
65 | 62 | 1 |
73 | 69 | 0 |
67 | 77 | 0 |
60 | 62 | 1 |
70 | 59 | 0 |
68 | 66 | 1 |
65 | 65 | 1 |
72 | 69 | 0 |
62 | 52 | 1 |
69 | 66 | 0 |
68 | 67 | 0 |
65 | 66 | 1 |
65 | 64 | 0 |
66 | 65 | 1 |
62 | 52 | 1 |
64 | 62 | 1 |
66 | 65 | 1 |
69 | 69 | 0 |
64 | 65 | 1 |
70 | 74 | 0 |
65 | 69 | 0 |
70 | 80 | 0 |
63 | NA | 1 |
67 | 70 | 1 |
64 | 64 | 1 |
64 | 62 | 1 |
6 | 5.7 | 0 |
67 | 67 | 1 |
72 | 71 | 0 |
73 | 75 | 0 |
68 | 68 | 0 |
67 | 63 | 1 |
66 | 67 | 1 |
67 | 36 | 0 |
68 | 72 | 0 |
73 | 70 | 0 |
70 | 70 | 0 |
70 | 72 | 0 |
60 | 58 | 0 |
70 | 68 | 0 |
62 | 63 | 0 |
68 | 68 | 1 |
67 | 67 | NA |
68 | 71 | 0 |
65 | 48 | 1 |
70 | 76 | 0 |
69 | 70 | 0 |
69 | 66 | 0 |
58 | 55 | NA |
64 | 64 | 0 |
Please help with the r codes. It is my first time doing r studio and I'm having a hard time. Thanks!
We are given a set of data of height and armspan.
The size of data is greater than 30
So t distribution converges to normal or z distribution.
a)
Let us define
diff = armspan - height
We have to test one sample t test.
Therefore, to test
against
test statistics is given by,
where, is mean of diff
is variance of diff
and n is size of diff.
from this t p value is calculated by any table or software.
Reject H0 if p value < 0.05
b)
In paired t test, the procedure is same as we have done above
c)
In two sample t test, the procedure is as follows.
Ler be the mean of heights,
be the mean of armspan.
be the variance of heights
be the variance of armspans
Here, i.e. sample size is same for both sample.
To test, against
The test statics is given by,
The p value is determined by same procedure above.
d)
When the variances are same, the test statics converts to
Where,
preceding procedure is same as above.
e)
In my view, if the height and the armspan given is of same person, then paired t tset is best method to verify the hypothesis.
It is because, for a person, height and the armspan are related to each other, that is the correlation between height and armspan is highly positive.
Note:
While doing on R software,, if you are working on basic version, you have to use arithmatic operations to calculate test static.
If you want to compute directly, you should have to download add on packages. like ggplot2 and others.