Question

In: Statistics and Probability

Why are we justified in pooling the population proportion estimates and the standard error of the...

Why are we justified in pooling the population proportion estimates and the standard error of the differences between these estimates when we conduct significance tests about the difference between population proportions?

Solutions

Expert Solution

Answer:

"Pooling" is the name given to a technique used to obtain a more precise estimate of the standard deviation of a sample statistic by combining the estimates given by two (or more) independent samples. When performing tests (or calculating confidence intervals) for a difference of two means, we do not pool. In other statistical situations we may or may not pool, depending on the situation and the populations being compared. For example, the theory behind analysis of variance and the inferences for simple regression are based on pooled estimates of variance. The rules for inference about two proportions firmly go both(!) ways. We always use a pooled estimate of the standard deviation (based on a pooled estimate of the proportion) when carrying out a hypothesis test whose null hypothesis is p1 = p2 -- but not when constructing a confidence interval for the difference in proportions. Why?

In any hypothesis test, we are calculating conditional probabilities based on the assumption that the null hypothesis is true. For example, in calculating the sample z with proportions or t with means, we use the values derived from the null hypothesis as the mean of our sampling distribution; if the null hypothesis determines a value for the standard deviation of the sample statistic, we use that value in our calculations. If the null hypothesis fails to give us a value for the standard deviation of our statistic, as is the case with means, we estimate the standard deviation of the statistic using sample data.

The special feature of proportions important for this discussion is that the value of p determines the value of (the standard deviation of ): . This is very different from the situation for means, where two populations can have identical means but wildly different standard deviations -- and thus different standard deviations of the sample means. We can't estimate from a value of ; we need to go back to the data and look at deviations. In the one-population case, this special feature means that our test statistic follows a z, rather than t, distribution when we work with one proportion. In this case, we actually do know the variance based on the null hypothesis.

When we move to considering two populations and the difference between proportions of "successes," our null hypothesis for a test is generally p1 = p2 (or equivalently, p1 - p2 = 0 ). This null hypothesis implies that the estimates of p1 and p2 -- that is, and -- are both estimates for the assumed common proportion of "successes" in the population (that is, the proportion). If the null hypothesis is true -- and all our calculations are based on this assumed truth -- we are looking at two independent samples from populations with the same proportion of successes. So with independent random samples, the variance of the differencein sample proportions ( ) is given by the sum of the variances, according to the familiar rules of random variables:
.

When we are carrying out a test, we don't know the value of p -- in fact, we are asking if there is any such single value -- so we don't claim to know the value for ( ). We calculate our best estimate of from our best estimate of p, which is "total number of successes/total number of trials" (in our usual notation, ). Substituting this value of for both p1 and p2 gives our estimate of ; we have merged the data from the two samples to obtain what is called the "pooled" estimate of the standard deviation. We have done this not because it is more convenient (it isn't -- there's more calculation involved) nor because it reduces the measurement of variability (it doesn't always -- often the pooled estimate is larger*) but because it gives us the best estimate of the variability of the difference under our null hypothesis that the two sample proportions came from populations with the same proportion. Using the inappropriate formula will either increase the β-risk beyond what is claimed or increase the α-risk beyond what is intended; neither is considered a good result.

Thus for a hypothesis test with null hypothesis p1 = p2, our test statistic (used to find the p-value or to compare to the critical value in a table) is with .

Of course, the above discussion applies only to hypothesis tests in which the null hypothesis isp = p2. For estimating the difference p1 - p2 , we are not working under the assumption of equal proportions; there would be nothing to estimate if we believe the proportions are equal. So our estimate of  p1 - p2 is . Likewise, if we have null hypothesis of the form p1 = p2 + k , our assumption is that the proportions are different, so there is no to estimate by pooling, and our test statistic is .

So we have the answer to the original question. When we carry out a test with null hypothesisp1 = p2, all our calculations are based on the assumption that this null is true — so our best estimate for the variance (and thus the standard deviation) of the difference between sample proportions ( ) is given by the "pooled" formula. In all other inferences on two proportions (estimation of a difference, a test with null p1 = p2 + k), we do not have any such assumption — so our best estimate for the variance of the difference between sample proporions is given by the "unpooled" formula. We pool for the one case, and do not pool for the others, because in the one case we must treat the two sample proportions as estimates of the same value and in the other cases we have no justification for doing so.

*A technical footnote: Here are some cases in which we can readily compare the relative sizes of pooled and unpooled estimates.

1. If , the two (pooled and unpooled) estimates of will be exactly the same, since we obtain .

2. If the sample sizes are equal (n1 = n2 = n), then . In this case, the unpooled estimate of the variance of the difference is , and the pooled estimate of variance of the difference is , which can (with heroic algebra!) be rewritten as , so the pooled estimate is actually larger unless the sample proportions are equal.

3. If the sample proportions are unequal but equally extreme (equally far from .5), then we have   and with 0  ‹ e ‹ .5. In this case, , the pooled estimate of variance can be written , and the unpooled estimate can be written and the difference is , so the pooled estimate is always larger than the unpooled estimate.

For example, with   and   (so that e = .3 ), with n1 = 10 and n2 = 15, the unpooled estimate of variance is .02667 and the pooled estimate is .04107, and   .

4. If the sample sizes are different enough (precise cutoffs are difficult to state), and the more extreme(further from .5) sample proportion comes from the largersample, the pooled estimate of the variance will be smaller than the unpooled estimate, but if the more extreme proportion is from the smaller sample, the pooled estimate of variance will be larger than the unpooled estimate. For example, consider the following table showing the effects of sample size when    and :

n1 n2 Pooled Estimate Unpooled Estimate
15 10 .0336 .025 Pooled is larger
10 15 .0286 .03 Pooled is smaller



For and : (same degree of "extremeness" as in the table, but on opposite sides of .5), a greater difference in sample sizes is required to show the same effect — but sample sizes of 15 and 35 suffice, as shown here:

n1 n2 Pooled Estimate Unpooled Estimate
35 15 .0236 .0129 Pooled is larger
15 35 .0179 .0186 Pooled is smaller

Related Solutions

Assume that the population proportion is 0.59. Compute the standard error of the proportion, σp, for...
Assume that the population proportion is 0.59. Compute the standard error of the proportion, σp, for sample sizes of 100, 200, 500, and 1,000. (Round your answers to four decimal places.) For a sample size of 100For a sample size of 200For a sample size of 500For a sample size of 1000 What can you say about the size of the standard error of the proportion as the sample size is increased? σp increases as n increases. σp decreases as...
17. For a population with a proportion equal to 0.32, calculate the standard error of the...
17. For a population with a proportion equal to 0.32, calculate the standard error of the proportion for the following sample sizes. a) 40 b) 80 c) 120 a. σp = _____(Round to four decimal places as needed.) b. σp = _____(Round to four decimal places as needed.) c.σp =______ (Round to four decimal places as needed.)
The 99% confidence interval for a population proportion is [0.645, 0.737]. Find the standard error involved...
The 99% confidence interval for a population proportion is [0.645, 0.737]. Find the standard error involved in this confidence interval. (Z_{a/2}Za/2​ = 2.58). Please show how you arrive at your answer so I can understand how to calculate this. If you know the excel commands, that would be helpful as well.
Explain the difference between the standard error of a sample proportion and the margin of error...
Explain the difference between the standard error of a sample proportion and the margin of error of a confidence interval for a population proportion.
How do we find a proportion of a population with only a mean and standard deviation?...
How do we find a proportion of a population with only a mean and standard deviation? SAT scores are normally distributed with a mean of 500 and a standard deviation of 100. For each of the SAT scores below, determine what proportion of the population lies below that score. scores, 550, 640,720,370
a. For the cases below, calculate the standard error for estimate of the proportion?
a. For the cases below, calculate the standard error for estimate of the proportion?                                              n = 500 and p = 0.1                                             n = 100 and p = 0.92                       b. Comment on whether the sample sizes are large enough so that the sample proportions can be approximated by a normal distribution
Determine the margin of error for a confidence interval to estimate the population proportion for the...
Determine the margin of error for a confidence interval to estimate the population proportion for the following confidence levels with a sample proportion equal to 0.36 and n=125. a. 90​%             b. 95​%             c. 98​% a. The margin of error for a confidence interval to estimate the population proportion for the 90% confidence level is _ b. The margin of error for a confidence interval to estimate the population proportion for the 95% confidence level is _ c. The margin of...
Determine the margin of error for a confidence interval to estimate the population proportion for the...
Determine the margin of error for a confidence interval to estimate the population proportion for the following confidence levels with a sample proportion equal to 0.45 and n equals=120. a. 90​% b. 95​% c. 99​%
Determine the margin of error for a confidence interval to estimate the population proportion for the...
Determine the margin of error for a confidence interval to estimate the population proportion for the following confidence levels with a sample proportion equal to .40 and n=100 A) 90% b) 95% c) 99%
Determine the margin of error for a confidence interval to estimate the population proportion for the...
Determine the margin of error for a confidence interval to estimate the population proportion for the following confidence levels with a sample proportion equal to 0.35 and n=120 a)90​% b)95​% c)98​% Click the icon to view a portion of the Cumulative Probabilities for the Standard Normal Distribution table. a. The margin of error for a confidence interval to estimate the population proportion for the 90 % confidence level is _____​(Round to three decimal places as​ needed.) b. The margin of...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT