In: Statistics and Probability
Question 2: A study was undertaken to compare the waste-generating behaviour of residents in four remote, isolated communities: Pétaouchnok, Malakazoo, Erehwon, and Naschmere. 20 households were randomly selected from each of these communities, and the average daily garbage output measured over a specified period of time. The data obtained is shown in the table below (values in kg/day of waste per capita): Pétaouchnok Malakazoo Erehwon Naschmere 2.3 4.5 1.1 1.7 3.3 3.0 4.1 1.1 3.3 2.3 2.0 0.0 4.4 1.8 3.9 3.1 2.6 3.3 2.1 3.9 3.3 5.2 2.7 3.4 5.3 3.7 5.0 3.3 0.0 3.1 2.6 2.9 3.7 2.2 4.3 3.1 2.5 5.0 5.4 3.1 1.9 5.0 2.7 1.3 2.9 1.8 0.7 1.7 1.0 3.0 2.8 1.8 4.4 4.6 2.8 1.9 3.6 4.0 3.1 2.1 3.7 1.7 4.1 3.2 3.4 3.1 3.7 2.6 4.0 3.2 4.8 1.9 2.5 2.5 3.0 2.7 3.7 1.7 3.7 2.9 a) Calculate the grand mean, as well as the sample mean and variance for each town, for average per capita waste in kg/day. b) At LOC = 95%, what would you conclude about whether or not there is any difference in garbage generation rates across these four towns? Use the critical-value method. c) Using the p-value method, determine if your conclusion from Part (a) would be different for any common values of LOC.
S. No | Pétaouchnok | Malakazoo | Erehwon | Naschmere |
1 | 2.3 | 4.5 | 1.1 | 1.7 |
2 | 3.3 | 3.0 | 4.1 | 1.1 |
3 | 3.3 | 2.3 | 2.0 | 0.0 |
4 | 4.4 | 1.8 | 3.9 | 3.1 |
5 | 2.6 | 3.3 | 2.1 | 3.9 |
6 | 3.3 | 5.2 | 2.7 | 3.4 |
7 | 5.3 | 3.7 | 5.0 | 3.3 |
8 | 0.0 | 3.1 | 2.6 | 2.9 |
9 | 3.7 | 2.2 | 4.3 | 3.1 |
10 | 2.5 | 5.0 | 5.4 | 3.1 |
11 | 1.9 | 5.0 | 2.7 | 1.3 |
12 | 2.9 | 1.8 | 0.7 | 1.7 |
13 | 1.0 | 3.0 | 2.8 | 1.8 |
14 | 4.4 | 4.6 | 2.8 | 1.9 |
15 | 3.6 | 4.0 | 3.1 | 2.1 |
16 | 3.7 | 1.7 | 4.1 | 3.2 |
17 | 3.4 | 3.1 | 3.7 | 2.6 |
18 | 4.0 | 3.2 | 4.8 | 1.9 |
19 | 2.5 | 2.5 | 3.0 | 2.7 |
20 | 3.7 | 1.7 | 3.7 | 2.9 |
I have entered the above given information in a table.
A sample of 20 households has been taken from 4 remote, isolated communities. The average daily garbage output of these 20 households was measured over a period of time.
a.
The grand mean is the sum of all the 80 values divide by 80.
The grand mean= 238.8 / 80 = 2.985
Instead of the complicated town names, I will call the towns 1,2,3 or 4 and add a subscript of 1,2,3 or 4 to denote the mean/ variance of that town.
Sample mean of town 1= sum of all values of town 1 divided by 20
= 61.8 / 20 = 3.09
Sample variance of town 1= sum of squared distances between the values and the mean divided by 20.
= 29.13474 / 20 = 1.456737
Sample mean of town 2= sum of all values of town 2 divided by 20
= 64.7 / 20 = 3.235
Sample variance of town 2= sum of squared distances between the values and the mean divided by 20.
= 27.18474 / 20 = 1.359237
Sample mean of town 3= sum of all values of town 3 divided by 20
= 64.6 / 20 = 3.23
Sample variance of town 3= sum of squared distances between the values and the mean divided by 20.
= 30.50736 / 20 = 1.525368
Sample mean of town 4= sum of all values of town 4 divided by 20
= 47.7 / 20 = 2.385
Sample variance of town 4= sum of squared distances between the values and the mean divided by 20.
= 18.30053 / 20 = 0.9150263
b.
The level of confidence is 95%. We must find out if there is a significant difference between the means of the four towns.
We will find the SSB and the SSW, and divide them with their degees of freedom to get the F statistic. We will compare this F statistic with the F critical value and thus decide if there is a significant difference or not.
Our Null Hypothesis = H0 : 1 = 2 = 3 = 4
Alternate Hypothesis= Ha : at least one of them is not equal. (that is, there is a difference in the average daily garbage output of at least 1 household).
While performing statistical analysis and testing hypothesis, it is important to understand the difference/ change must be significant. Most of the tests/ analysis is performed on a sample data, and this sample mostly never exactly represents the population (and its parameters). Hence, we must account for variability and see if it is possible that the (in our case) means are not the same due to the inherent noise in the data (SSW) or the difference across groups (SSB).
SSW= Total sum of squares within
SSB= Total sum of squares between
SSW= sum of squared difference between mean and the points
= 27.678+ 25.8255+ 28.982+ 17.3855 ( group 1 SSW+group 2 SSW+group 3 SSW+group 4 SSW)
= 99.871
SSB= sum of squared difference between group means and the grand mean
= 20* (0.011025+ 0.0625+ 0.060025+ 0.36)
= 20* 0.49355= 9.871
Say, there are m groups(4) and n members(20) in each group
df (SSB)= m-1 = 4-1= 3
df (SSW)= m(n-1) = 4(20-1)= 76
Thus, F statistic = [(SSB/ dfSSB)] / [(SSQ/ dfSSW)]
= (9.871/ 3) / (99.871/76)
= 3.290333/ 1.314092
= 2.503883
Therefore, the F statistic is 2.503883
The F critical value at 0.05 significance level and df of 3 and 76 is 2.73 ( F table value)
SINCE THE F STATISTIC IS LESS THAN THE F CRITICAL VALUE, WE FAIL TO REJECT (or accept) THE NULL HYPOTHESIS AND CONCLUDE THAT THE AVERAGE DAILY GARBAGE OUTPUT ACROSS THE 4 CITIES IS IN FACT EQUAL.
c.
The p-value of our F statistic can be found by looking for 2.503883 in the F-table. (df of 3 and 76).
At a numerator df of 3 and denominator df ~ 76, the p value of 2.503 is greater than 0.05 but less than 0.10. We can say this by looking at the F table. The df of numerator is 3, and the closest denominator value is 60. The p value of 0.05 shows that the F value is around 2.7, but the p value of 0.10 shows that the F value is around 2.
Thus, we can say that the p value of the F statistic is above 0.05, but less than 0.10.
(You can confirm this using a software. The p value is around 0.06).
Thus, at 5% level of significane, we can accept the null hypothesis, but at the 10% level of significance, we reject the null hypothesis.