In: Statistics and Probability
Use the following data to create the contingency tables.
AGE
Male
16 17 17 19 19 19 18 17 18 17 16 19 19 19 17 16 17 16 19 19 24 31 23 44 21 42 23 43 43 33 30 41 35 40 24 43 22 30 25 32
43 51 55 80 61 58 65 52 67 75 90 63 71 74
Female
17 16 17 19 19 18 17 19 16 18 19 17 19 17 18 19 19 16 33 23 46 46 23 21 46 47 48 47 48 30 35 24 48 49 47 25 84 54 77 63 51 72 90 57 69 81
1. In the first table, use gender (male and female) as your row variable and age (<20, 20-50, and >50) for
your column variable. Run a Chi-square test of independence and find the test statistic, p-value, and
degrees of freedom.
2. In the second table, use gender (male and female) as your row variable and age (<18, 18-25, 26-45,
and >45) for your column variable. Run a Chi-square test of independence and find the test statistic,
p-value, and degrees of freedom.
3. Compare the results and comment on problems that may occur when categorizing continuous
variables.
H0: The variables gender and age are independent.
H1: The variables gender and age are not independent.
(H0: Null hypothesis; H1: Alternative hypothesis)
1.
Contingency table (Observed frequencies):
<20 | 20-50 | >50 | Total | |
Male | 20 | 21 | 13 | 54 |
Female | 18 | 18 | 10 | 46 |
Total | 38 | 39 | 23 | N =100 |
Expected frequency, E =(Corresponding Row Total* Corresponding
Column Total)/N
Contingency table (Expected frequencies):
<20 | 20-50 | >50 | Total | |
Male | 54*38/100 =20.52 | 54*39/100 =21.06 | 54*23/100 =12.42 | 54 |
Female | 46*38/100 =17.48 | 46*39/100 =17.94 | 46*23/100 =10.58 | 46 |
Total | 38 | 39 | 23 | N =100 |
Test statistic():
Sl. No: | Observed frequency: O | Expected frequency: E | (O-E)2/E |
1. | 20 | 20.52 | (20-20.52)2/20.52 =0.0132 |
2. | 18 | 17.48 | (18-17.48)2/17.48 =0.0155 |
3. | 21 | 21.06 | 0.0002 |
4. | 18 | 17.94 | 0.0002 |
5. | 13 | 12.42 | 0.0271 |
6. | 10 | 10.58 | 0.0318 |
Total | 100 | 100 | 0.088 |
The test statistic, = =0.088
Degrees of freedom, df =(r-1)(c-1) =(2-1)(3-1) =1(2) =2
For the test statistic, =0.088 and at df =2, the p-value = 0.956954
p-value is very high (>0.01; >0.05 and >0.10 significance levels) indicating that we cannot reject the null hypothesis that says "the variables gender and age are independent".
2.
Contingency table for observed frequencies:
<18 | 18-25 | 26-45 | >45 | Total | |
Male | 10 | 17 | 14 | 13 | 54 |
Female | 8 | 15 | 3 | 20 | 46 |
Total | 18 | 32 | 17 | 33 | N =100 |
Contingency table for expected frequencies:
<18 | 18-25 | 26-45 | >45 | Total | |
Male | 9.72 | 17.28 | 9.18 | 17.82 | 54 |
Female | 8.28 | 14.72 | 7.82 | 15.18 | 46 |
Total | 18 | 32 | 17 | 33 | N =100 |
Test statistic():
Sl. No. | Observed frequency: O | Expected frequency: E | (O-E)2/E |
1. | 10 | 9.72 | 0.0081 |
2. | 8 | 8.28 | 0.0095 |
3. | 17 | 17.28 | 0.0045 |
4. | 15 | 14.72 | 0.0053 |
5. | 14 | 9.18 | 2.5308 |
6. | 3 | 7.82 | 2.9709 |
7. | 13 | 17.82 | 1.3037 |
8. | 20 | 15.18 | 1.5305 |
Total | 100 | 100 | 8.3633 |
The test statistic, =8.3633
Degrees of freedom, df =(r-1)(c-1) =(2-1)(4-1) =1(3) =3
For the test statistic, =8.3633 and at df =3, the p-value = 0.039071
p-value is low (<0.05; <0.10) indicating that we can reject the null hypothesis that says "the variables gender and age are independent".
(However at 0.01 significance level, p-value of 0.039071 > 0.01 indicating that we cannot reject the null hypothesis).
3.
The results of 1. and 2. are significantly different with higher and lower p-values that resulted in opposite conclusions at 5% and 10% significance levels.
This is because of the problem of different cut points when categorizing the continuous variable (age in this case).
When categorizing continuous variables, cut points are a major problem. How can one decide where to cut? It depends purely on the researcher and what he wants to determine but it's not that simple to decide where to cut and different cut points may result in contradictory conclusions as above.
Another problem is the loss of information.