In: Statistics and Probability
Describe how to obtain a p-value for a chi-squared test for goodness of fit. Then describe how to obtain a p-value for a chi-squared test for independence. Make sure how to point out the differences from your answer to the question above.
Please use simple terms! I have no idea what's going on.
In a chi-squared test for goodness of fit, the given data set is compared to a hypothesized distribution which it is expected to follow. Generally, we are given a frequency distribution of the values, and then the expected frequencies are computed bya ssuming that it follows certain discrete distribution. For this hypothesized distribution, either a standard distribution function might be given with defined paraeters, ex. Normal distribution with certain mean and variance, Poisson distribution with certain rate, Uniform distribution etc. or we might be given an explicitly defined discrete distribution.
To find the p-value, we first need to test statistic. This is obtained as
To find the p-value, we need to define the degrees of freedom of the statistic. This is one less than the number of categories or data values.
The p-value is finally obtained using a Chi-square calculator, for this statistic value
This might not be easy to understand just theoretically. So here is an illustration on this type of Chi-square problem.
We need to conduct the Chi-square Goodness of Fit test, based on a given discrete probability distribution.
Null Hypothesis, H0: New machines follow the same probability distribution to perform the job as the old machines
Alternate hypothesis, Ha: New machines do not follow the same probability distribution to perform the job as the old machines
We can compute the expected frequencies as per below table, by multiplying respective probability with the total frequency. The Chi-square statistic has been computed using
Type | Old Machine | Observed Frequency | Expected Frequency | Chi-Square Statistic |
Top Grade | 0.4 | 174 | 160 | 1.225 |
High Grade | 0.3 | 115 | 120 | 0.208333333 |
Medium Grade | 0.2 | 71 | 80 | 1.0125 |
Low Grade | 0.1 | 40 | 40 | 0 |
SUM | 1 | 400 | 400 | 2.445833333 |
Critical Value: The test has 3 degrees of freedom, viz. one less than the number of categories. Hence, the p-value is
The calculator used is
https://www.danielsoper.com/statcalc/calculator.aspx?id=11
In a chi-squared test for independence, the data is given in form of a table across two categories, one along the row and other along the column, which can further have several levels or sub-categories. The objective is to determine whether or not, there is an association between the two categories. That is, does the data depend on the level of each category, or the distribution of values is independent of the levels.
The formula for Chi-square statistic may look similar, but the way to compute it is entirely different, since the data is in form of a table instead of a frequency distribution.
The computation of expected values is also a bit different. First, find the sum of given values along each row and column. To find the expected vaue at the (i, j) cell value, multiply the row sum and column sum along that cell, and divide by the sum total of all observations. Symbolically,
To find the p-value, we need to define the degrees of freedom of the statistic. This is the product of one less than the number of columns, and one less than the number of rows .
The p-value is finally obtained using a Chi-square calculator, for this statistic value
This might not be easy to understand just theoretically. So here is an illustration on this type of Chi-square problem.
(1) We need to perform the Chi-square test for independence between the Verdict, and the categories of seasons. That is, whether the proportion of data differs between the two variables as one is changed against the other. If there is an association, we shall have a significant result.
(2) The hypothesis statements are
H0, Null Hypothesis: The distribution (proportion) of different types of verdicts among Guilt, Not Guilty, Plea Bargain and Other does not change across the different 5-season populations of Suits. That is, the two variables are independent of each other.
H1, Alternate hypothesis: The distribution (proportion) of at least one out of the verdicts among Guilt, Not Guilty, Plea Bargain and Other, changes across the different 5-season populations of Suits. That is, the two variables depend on each other.
Such problems are best done on excel. since the Chi-square test involves a lot of cross referncing across table repeatedly. The process is to first find the expected values for each cell. This is computed as
Hence, the table of expected values can be tabulated as below
Observed Values | |||||
Guilty | Not Guilty | Plea Bargain | Other | Total | |
Season 1-5 | 31 | 4 | 26 | 20 | 81 |
Season 6-10 | 28 | 7 | 33 | 20 | 88 |
Season 11-15 | 33 | 8 | 30 | 19 | 90 |
Season 16-20 | 26 | 4 | 24 | 23 | 77 |
Total | 118 | 23 | 113 | 82 | 336 |
Expected Values | |||||
Guilty | Not Guilty | Plea Bargain | Other | Total | |
Season 1-5 | 28.44642857 | 5.544642857 | 27.24107143 | 19.76785714 | 81 |
Season 6-10 | 30.9047619 | 6.023809524 | 29.5952381 | 21.47619048 | 88 |
Season 11-15 | 31.60714286 | 6.160714286 | 30.26785714 | 21.96428571 | 90 |
Season 16-20 | 27.04166667 | 5.270833333 | 25.89583333 | 18.79166667 | 77 |
Total | 118 | 23 | 113 | 82 | 336 |
For ex, the expected value of Season 6-10, Guilty is obtained as product of 118 with 81 (row sum and column sum), divided by 336, the total sum.
Next step is to compute the Chi-square statistic. This is computed as the summation of squared deviations of observed and expected values, divided by the expected value for all cells in the table.
This is again computed by using above tables as
Chi-Square Statistic | |||||
Guilty | Not Guilty | Plea Bargain | Other | Total | |
Season 1-5 | 0.22922832 | 0.430311134 | 0.056541766 | 0.002726158 | 4.083888569 |
Season 6-10 | 0.273020765 | 0.158196876 | 0.391698272 | 0.101467638 | |
Season 11-15 | 0.061380145 | 0.549120083 | 0.002370417 | 0.400058072 | |
Season 16-20 | 0.040125835 | 0.306406456 | 0.138793913 | 0.94244272 |
To make a conclusion, we either need to find the p-value of this observed statistic, or find a critical value of the Chi square table against a certain degrees of freedom and level of significance. Since no level of significance is specified let us find the p-value.
The degrees of freedom of the test is
Hence, The p-value is obtained from the Chi-square probability calculator for 9 degrees of freedom
Conclusion: The p-value is very large, since the general criteria for a significant result is that the p-value should be less than 0.05, 0.01 etc. Thus, we have a strong evidence that the two variables are independent. Hence we must accept the null hypothesis and conclude that the distribution (proportion) of different types of verdicts among Guilt, Not Guilty, Plea Bargain and Other does not change across the different 5-season populations of Suits
Excel Link: https://drive.google.com/file/d/1doePWbdFk51yp1HQBs-KrPbWFbWUQtuk/view?usp=sharing