In: Statistics and Probability
Dear student, please comment in the case of any doubt and I would love to clarify it.
Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other by analyzing comparisons of variance estimates. ANOVA checks the impact of one or more factors by comparing the means of different samples.
When we have only two samples, t-test and ANOVA give the same results. However, using a t-test would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests for comparing more than two samples, it will have a compounded effect on the type 1 error.
Assumptions in ANOVA
1) Assumption of Randomness: The samples should be selected in a random way such that there is no dependence among the samples.
2) The experimental errors of the data are normally distributed.
3) Assumption of equality of variance (Homoscedasticity) and zero correlation: The variance should be constant in all the groups and all the covariance among them is zero although means vary from group to group.
One Way ANOVA
When we are comparing groups based on an only one-factor variable, then it said to be a one-way analysis of variance (ANOVA).
For example, if we want to compare whether or not the mean output of three workers is the same based on the working hours of the three workers.
The ANOVA model:
Mathematically, ANOVA can be written as:
xij = μi + εij
where x is the individual data points (i and j denote the group and the individual observation), ε is the unexplained variation and the parameters of the model (μ) are the population means of each group. Thus, each data point (xij) is its group means plus error.
Let’s understand the working procedure of One-way Anova with an example:
Sample(k) |
1 |
2 |
3 |
Mean |
1 |
x11 |
x12 |
x13 |
Xm1 |
2 |
x21 |
x22 |
x23 |
Xm2 |
3 |
x31 |
x32 |
x33 |
Xm3 |
4 |
x41 |
x42 |
x43 |
Xm4 |
Suppose we are given with the above data set; we have an independent variable x and 3 samples with different values of x and each sample has its respective mean as shown in the last column.
Grand Mean
Mean is a simple or arithmetic average of a range of values. There are two kinds of means that we use in ANOVA calculations, which are separate sample means and the grand mean.
The grand mean (Xgm) is the mean of sample means or the mean of all observations combined, irrespective of the sample.
Xgm = (Xm1 + Xm2 + Xm3 + Xm4 +………. Xmk)/k where, k is the number of samples
For our dataset, k = 4
Xgm = (Xm1 + Xm2 + Xm3 + Xm4)/4
Between Group Variability (SST)
It refers to variations between the distributions of individual groups (or levels) as the values within each group are different.
Each sample is looked at and the difference between its mean and grand mean is calculated to calculate the variability. If the distributions overlap or are close, the grand mean will be similar to the individual means whereas if the distributions are far apart, the difference between means and grand mean would be large.
Two Way ANOVA
Two-way ANOVA allows comparing population means when the populations are classified according to two independent factors.
Example: We might like to look at SAT scores of students who are male or female (first factor) and either have or have not had a preparatory course(second factor).
The Two-way ANOVA model:
Mathematically, ANOVA can be written as:
xij = μij + εij
where x is the individual data points (i and j denote the group and the individual observation), ε is the unexplained variation and the parameters of the model (μ) are the population means of each group. Thus, each data point (xij) is its group means plus error.
Just like a one-way model, we will calculate the sum of squares between, in this case, there will be two SSTs for both the categories and sum of squares of errors (within).
We calculate F-statistics for both the MSST and see which once greater value than F-critical and compare them to find the effect of both categories on our assumption.
Example:
Below given is the data of yield of crops based on temperature and salinity. Calculate the ANOVA for the table.
Temperature (in F) |
Categorical variable salinity |
||||
700 |
1400 |
2100 |
Total |
Mean(temp) |
|
60 |
3 |
5 |
4 |
12 |
4 |
70 |
11 |
10 |
12 |
33 |
11 |
80 |
16 |
21 |
17 |
54 |
18 |
Total |
30 |
36 |
33 |
99 |
11 |
Mean(salanity) |
10 |
12 |
11 |
11 |
Ans:
Hypothesis for Temperature:
H0: Yield is same for all temperature
H1: yield varies with temperature with significant difference
Hypothesis for Salinity:
H0: Yield is same for all Salinity
H1: yield varies with temperature with significant Salinity
Grand mean = 11
N = 9, K =3, nt= 3, ns = 3
SSbetween_temp = 3 *(4-11)^2 + 3*(11-11)^2 + 3*(18-11)^2 = 294
MSSTtemp = 294 / 2 = 147
SSbetween_salanity = 3 *(10-11)^2 + 3*(12-11)^2 + 3*(11-11)^2 = 6
MSSTsalainity = 6 /2 = 3
In such question calculating SSE can be tricky, so instead of calculating SSE let’s calculate TSS then we can subtract SST values from it and get SSE.
To calculate Total sum of squares, we need to find sum of the squares of difference of each value from the grand mean.
TSS = (3-11)^2 + (5-11)^2 + (3-11)^2 +(4-11)^2 +(11-11)^2 +(10-11)^2 +(12-11)^2 +(16-11)^2 +(21-11)^2 +( 17-11)^2
TSS = 312
SSE = TSS - SSbetween_temp - SSbetween_salanity = 312 – 294-6 = 12
Degree of freedom for SSE = (nt-1)( ns-1) =(3-1)(3-1) = 4
MSSE = SSE/4 = 3
F-Test For temperature
Ftemp = MSSTtemp/ MSSE = 14/3 = 49
F-Test For Salinity
Fsalinity = MSSTsalinity/MSSE = 3/3 = 1
F-critical for 5% significance and degree of freedom (k-1, (p-1) (q-1)) i.e. (2,4):
F-critical = 10.649
Clearly, we can see that Ftemp is greater than F-critical, so we reject the null hypothesis and support that temperature has a significant effect on yield.
On the other hand, Fsalinity is less than the F-critical value, so we do not reject the null hypothesis and support that salinity doesn’t affect the yield.