Question

In: Statistics and Probability

In this question, we will formulate a measure to quantify the level of association between the two categorical variables.

 

In this question, we will formulate a measure to quantify the level of association between the two categorical variables. Such a measure is often used in a statistical test called Chi-square test for assessing whether there is an association between two categorical variables. This question is also used to motivate the learning of independence and to connect the concept back to what we have learnt in the course.

Let's revisit the example we have looked at in the course. How is diet type (high cholesterol diet versus low cholesterol diet) related to the risk of coronary heart disease? Data of 23 individuals:


High cholesterol dietLow cholesterol dietHeart disease(i) 11(ii) 213No heart disease(iii) 4(iv) 610Total15823Heart diseaseNo heart diseaseTotalHigh cholesterol diet(i) 11(iii) 415Low cholesterol diet(ii) 2(iv) 68131023



From the table we find that the probability of having heart disease is 13/2313/23 and the probability of having high cholesterol diet is 15/2315/23. Similarly, we can find the probability of not having heart disease and the probability of having low cholesterol diet.

Part a
If there is no association between the two variables (i.e., the two are independent), the probability of having heart disease and high cholesterol diet is: [Round to four decimal places].


Part b
If the two variables are independent, we should expect the number of individuals with heart disease and high cholestoral diet to be the probability in Part a multiplied by 23 individuals, which is: [Round to two decimal places].


Part c
Repeating Part b, we find that the expected number of individuals for the cells (ii), (iii), (iv) respectively on the table are: 4.52, 6.52, 3.48.

The following measure (called Chi-square test statistic):

?2=∑(Observed−Expected)2Expectedχ2=∑(Observed−Expected)2Expected

quantifies the level of association between two categorical variables. The symbol ∑∑ means a sum. "Observed" here refers to the observed counts on the table, while "Expected" refers to the expected counts given independence for the two variables is true. The sum is taken across all the cells (i) to (iv) on the table.

If there is no association, the observed counts should not differ very much from the expected counts, which results in a relatively small value of ?2χ2. A large ?2χ2 value indicates disagreement between the expected and observed counts which suggests the assumption of independence does not hold and the two variables are likely to be associated.

Compute ?2χ2. [Round to two decimal places].


Of course, how large is large is another problem and this is beyond the scope of this course.

Solutions

Expert Solution


Part a

(13/23)*(15/23) = 0.3686

Part b

13*15/23 = 8.48

Part c

Col 1   Col 2   Col 3   Total  
Row 1 Observed   11 4 15 30
Expected   8.48 6.52 15.00 30.00
O - E   2.52 -2.52 0.00 0.00
(O - E)² / E   0.75 0.98 0.00 1.73
Row 2 Observed   2 6 8 16
Expected   4.52 3.48 8.00 16.00
O - E   -2.52 2.52 0.00 0.00
(O - E)² / E   1.41 1.83 0.00 3.23
Row 3 Observed   13 10 23 46
Expected   13.00 10.00 23.00 46.00
O - E   0.00 0.00 0.00 0.00
(O - E)² / E   0.00 0.00 0.00 0.00
Total Observed   26 20 46 92
Expected   26.00 20.00 46.00 92.00
O - E   0.00 0.00 0.00 0.00
(O - E)² / E   2.16 2.80 0.00 4.96
4.96 chi-square

χ2 = 4.96


Related Solutions

We want to assess whether there is a statistically significant association between two variables. Below are...
We want to assess whether there is a statistically significant association between two variables. Below are pairs of variables, along with their method of measurement. Indicate, justifying it in two lines, for each pair, which statistical test you would use. a) Total Cholesterol (mmol/l) and Sex (Male/Female). b) Red blood cells (millions/microlitre of blood) and Body Mass Index (kg/m2). c) Foot Pain (Severe/Levere) and Obesity (Yes/No). d) Marital status (Single/Married/Divorced) and Educational level (Primary/Secondary/University studies). e) Type 2 diabetes (Yes/No)...
QUESTION 7 Which of the following measures the degree of linear association between two variables? a....
QUESTION 7 Which of the following measures the degree of linear association between two variables? a. covariance. b. standard deviation. c. variance. d. coefficient of variation QUESTION 8 If the sample size becomes larger, to which distribution does the sampling distribution of the sample mean converge? a. Normal distribution. b. Poisson distribution. c. Binomial distribution. d. Uniform distribution. QUESTION 9 Which of the following means an estimate of a population parameter that provides an interval of values believed to contain...
Information about an association between two interval-ratio variables is presented below. The association is between “the...
Information about an association between two interval-ratio variables is presented below. The association is between “the hours of screen time per day” (Y) and “years of schooling” (X). A measure of the overall association is given as well as the specific components of the OLS model. The OLS model estimates the effect of education (X) on the hours of screen time per day (Y). Association Between x and y Estimate r    -0.229 Rsqrd OLS Model components Estimate Constant (a)...
When you determine if there is an association between two variables, it is also important for...
When you determine if there is an association between two variables, it is also important for you to determine how strong or weak that association is. This is why, when you have data for two quantitative variables, you calculate what is called the coefficient for correlation. Instructions Suppose you are determining the association between the weight of a car and the miles per gallon that the car gets. Answer the following questions in a Word document: define correlation and explain...
When is it inappropriate to use linear regression for measuring the association between two variables?
When is it inappropriate to use linear regression for measuring the association between two variables?
True or Flase Which of the statements is true? a. Observing an association between two variables...
True or Flase Which of the statements is true? a. Observing an association between two variables automatically means that there is a cause and effect relationship. b. The fact that there is a strong association between smoking and lung cancer shows that smoking actually causes lung cancer. c. The best way to make a case for causation is to do a good experiment. d. A strong observed association is due to direct cause-and-effect, never the effects of a lurking variable....
A measure of the strength of the linear relationship that exists between two variables is called:...
A measure of the strength of the linear relationship that exists between two variables is called: Slope/Intercept/Correlation coefficient/Regression equation. If both variables X and Y increase simultaneously, then the coefficient of correlation will be: Positive/Negative/Zero/One. If the points on the scatter diagram indicate that as one variable increases the other variable tends to decrease the value of r will be: Perfect positive/Perfect negative/Negative/Zero. The range of correlation coefficient is: -1 to +1/0 to 1/-∞ to +∞/0 to ∞. Which of...
QUESTion 6 The association between the variables "golf score" and "golf skill" would be a. POSITIVE...
QUESTion 6 The association between the variables "golf score" and "golf skill" would be a. POSITIVE b. NEGATIVE c. NEITHER QUESTION 7 If the correlation coefficient for a lnear regression is 0.987. there is sufficient evidence that a linear relationship exists between the x and y data a. TRUE b. FALSE QUESTION 8 If the correlation coefficient for a lnear regression is -0.932. there is sufficient evidence that a linear relationship exists between the x and y data a. TRUE...
Topic: Categorical Dependent Variables: You want to test to see if there is a relationship between...
Topic: Categorical Dependent Variables: You want to test to see if there is a relationship between whether or not someone has blue eyes and whether or not they have blond hair. You collect data and observe the results in the following table. Observed Blue Eyes Not Blue Eyes Total Blonde Hair 39 86 125 Not Blonde Hair 90 273 363 Total 129 359 488 1. If the two variables are independent, what would you estimate is the probability of observing...
data set will need at least four variables - at least two categorical and at least...
data set will need at least four variables - at least two categorical and at least two quantitative. For example, you might consider the following variables for American participants in a survey: birth month (categorical), state of birth (categorical), average number of bowls of cereal eaten per week (quantitative), and amount spent on groceries (quantitative). (a) First, formulate a research question relating to two of your quantitative variables along the lines of "how does *quantitative variable 1* relate to *quantitative...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT