In: Statistics and Probability
In this topic, you'll do a calculation in RStudio to demonstrate the difference between using the paired and unpaired methods to analyze a difference of means. As an example data set, we'll use the county data set from the openintro package. This contains observations of several variables for each county in the USA, but we'll focus on two: pop2000, the county's population in the 2000 census, and pop2010, the population in the 2010 census.
Our goal is to calculate a 95% confidence interval for the
difference between log10(pop2000) and log10(pop2010).
In R, log10 is the base-10 logarithm (I'll explain at the end of
the prompt why we look at logs instead of raw numbers).
Follow the following procedure:
Use sample() to select 250 rows at random from the county data set. Then, using your sample of 250 rows, calculate a 95% confidence interval for the difference in two ways:
Note: you may use the same degrees of freedom, 249, for each procedure, instead of using the complicated formula from section 5.3 for the independent sample. At this sample size, the difference will be negligible.
Include in your post:
1. the point estimate you got in each procedure
2. the standard error you got in each procedure
3. the confidence interval you got in each procedure
Comment briefly on the difference you see. In each case, are you able to conclude with 95% confidence that the population increased from 2000 to 2010 (i.e. that the difference is greater than zero)?
Why we use logs here: Populations tend to grow or shrink by percentages, so looking at absolute differences would not give a very good picture of the changes. In particular, the same percentage growth in a county of 30,000 looks very different in a county of 3 million. Taking logs converts multiplication to addition, which ensures we are looking at relative differences and puts large and small counties on the same scale
Hypothesis of interest: H0: population mean of log10(pop2010) - population mean of log10(pop2000) = 0
vs. H1: population mean of log10(pop2010) - population mean of log10(pop2000) > 0
1. For the paired case, the sample estimate = mean of the log difference = 0.01877433
c the sample estimates are,
mean of log10(pop2010) = 4.450429 and mean of log10(pop2010) = 4.431655
2. For the paired case, the standard error for the mean of the log difference is approximately 0.0032.
For the independent case, the standard error for the mean- log-differences is approximately 0.0518.
3. For the paired case, the 95% confidence interval is [0.01241094 0.02513773].
For the independent case, the 95% confidence interval is [-0.08292507 0.12047374].
4. For the paired case, we can conclude with 95% confidence that the population increased from 2000 to 2010 as p-value is less than 0.05.
For the independent case, we conclude that with 95% confidence the population did not increase from 2000 to 2010 as p-value is greater than 0.05.