In: Statistics and Probability
Part 3: The Central Limit Theorem for a Sample Proportion
There are 4 conditions that must be true in order to use the Central Limit theorem. 1) We must have a simple random sample (SRS); 2) the sample size must be less than 10% of the population; 3) the observations must be independent; and 4) the sample size must be large enough so that both np > 10 and n(1 - p) >10, in which the true proportion (or probability) possessing the attribute of interest is p. Then the Central Limit Theorem predicts three things about the sampling distribution of the sample proportion :
Shape: The distribution is approximately normal.
Center: The mean will equal p.
Spread: The standard deviation will equal p(1-p)n
Note: This normal approximation becomes more and more accurate as the sample size increases and is generally considered to be valid as long as np > 10 and n(1 - p) >10.
If you continue to assume that the population proportion of coin flips is p, the value you entered in Part 2, what does the CLT predict for the shape of the sampling distribution and for the values of the mean and standard deviation of the sampling distribution of sample proportions when the sample consists of a given number of flips? Using the formulas above, complete the table below.
n = 20 | n = 60 | n = 180 | |
Shape | |||
Center (mean) | |||
Spread (SD) |
How do the actual values compare with the simulated values in Part 2?
Did the sampling distribution change as the sample sizes increased? Explain.
The table can be completed as follows:
n | 20 | 60 | 180 |
Shape | Since np = 20*0.05 = 1 which is less than 10, so the distribution will have a lot of skewness | Since np = 20*0.05 = 3 which is less than 10, so the distribution will have a lot of skewness | Since np = 20*0.05 = 9 which is very close to 10, so the distribution will tend to be approximately normal |
Center | |||
Spread = sqrt{p(1-p)/n} |
I will simulate the problem in the question using R software:
function simulation for the sample mean sim.func <- function(n, p, n.sims=1e4){ # vector to store to sample mean res <- numeric(n.sims) for (i in 1:n.sims){ obs.samp <- sample(c(1, 0), n, prob = c(p, 1-p), replace = T) res[i] <- mean(obs.samp) } exp.mu <- p exp.sd <- sqrt(p*(1-p)/n) hist(res, bins=1000, xlab="Sample means", prob=T, main=paste("n = ", n)) curve(dnorm(x,exp.mu, exp.sd), col="red", add=T) } par(mfrow=c(3, 1)) for (n in c(20, 60, 180)){ sim.func(n, 0.05) }
The plot generated is:
It can be seen that the sampling distribution changes with
increase in sample size. At n = 20, sufficient skewness
can be seen in the histogram, but as sample size increases to n =
180, we notice a more symmetric normal distribution that matches
quite well with the expected curve in red