In: Statistics and Probability
Purpose:
• To create and interpret confidence intervals for the population proportion or population mean.
• To do hypothesis testing on a population proportion or population mean. Due Date: Nov 27, 2018 at the beginning of class.
What you must deliver:
1. Formulate a statistical hypothesis. 2. Develop a data production strategy. 3. Collect sample data. 4. Solutions to the questions (See below). 5. Reflection.
Suggested ideas to consider:
• Proportion of students at Cañada College who can raise one eyebrow without raising the other eyebrow.
• Mean age of cars driven by (statistics) students and/or mean age of cars driven by faculty.
• Proportion of students at Cañada College who can correctly identify the President, the Vice President, and the Secretary of State.
• Proportion of students at Cañada College who are over the age of 18 and are registered to vote.
• Mean age of evening class student at Cañada College
. • Proportion of student cars that are (white).
• Mean number of hours that students work at Cañada College each week.
• Mean age of books (based on copyright dates) from the library.
• Proportion of books that are over years from the library.
• Proportion of pages of a sample of different issues that contain advertising
GRADING RUBRIC: Total Score (50) 1. Collect sample data. (5 points) 2. Solutions to the questions (40 points Total) - Summary of data (5) - Compute margin of error correctly (5) - Compute confidence interval correctly (10) - Perform the hypothesis test correctly (15) - Interpret the result of the test correctly (5) 3. Reflection. (5 points)
Explore your own Data Set.
1.Select a research question from the given list, or make up your own question
. Write down the question selected
. 2. Decide whether you would use the point estimate for population mean or population proportion.
Describe the population you are targeting.
3. Collect the data. Collect a minimum of 31 sample data. Proper data collection methods (i.e. randomization) should be used if possible. If proper methods cannot be used, then this must be acknowledged and the reasoning for using the less than proper methods explained. Describe how you obtain your data in 3-5 sentences.
4. Summarize the data. Use additional pages if necessary. a. You must provide ALL of your sample data based on the topic you choose. b. Identify �, �̂, �, and/or �̅where appropriate. c. List the sample size and determine the necessary data values to do the calculation. Use the correct variables. d. Find the 75%, 95%, and 99% confidence intervals. (Do all three) e. Determine the Margin of Error for the 75%, 95%, and 99% confidence intervals.
5. Interpret the results of the confidence interval.
6. Hypothesis Testing. a. Formulate your statistical claim against a population proportion or a population mean. (i.e. Less than 30% of the students at Cañada College…..) b. Show the seven steps to your hypothesis testing and its result. c. Identify which test (left-tail, right-tail, two-tail), which distribution (z-Test statistics or t-Test statistics), and which method (Critical Value Method or P-Value Method) you used. d. Supply all necessary work with diagrams.
7. Interpret the results of the hypothesis testing. STEPS 1-7 can be hand-written, in a legible manner.
8. Reflection: Each student must write up a half-page to one-page reflection, typed, choosing three of the following questions
. a. What were your overall thoughts about this project? Explain any surprises.
b. How did this project help you understand statistics better?
c. Do you feel you worked as efficiently as possible? What can you do to improve your efficiency?
d. Explain how this project is relevant to something you have experienced or seen in the real world?
Let’s move on to see how confidence intervals account for that margin of error. To do this, we’ll use the same tools that we’ve been using to understand hypothesis tests. I’ll create a sampling distribution using probability distribution plots, the t-distribution, and the variability in our data. We'll base our confidence interval on the energy cost data set that we've been using.
When we looked at significance levels, the graphs displayed a sampling distribution centered on the null hypothesis value, and the outer 5% of the distribution was shaded. For confidence intervals, we need to shift the sampling distribution so that it is centered on the sample mean and shade the middle 95%.
The shaded area shows the range of sample means that you’d obtain 95% of the time using our sample mean as the point estimate of the population mean. This range [267 394] is our 95% confidence interval.
Using the graph, it’s easier to understand how a specific confidence interval represents the margin of error, or the amount of uncertainty, around the point estimate. The sample mean is the most likely value for the population mean given the information that we have. However, the graph shows it would not be unusual at all for other random samples drawn from the same population to obtain different sample means within the shaded area. These other likely sample means all suggest different values for the population mean. Hence, the interval represents the inherent uncertainty that comes with using sample data.
You can use these graphs to calculate probabilities for specific values. However, notice that you can’t place the population mean on the graph because that value is unknown. Consequently, you can’t calculate probabilities for the population mean, just as Neyman said!
Why P Values and Confidence Intervals Always Agree About Statistical Significance
You can use either P values or confidence intervals to determine whether your results are statistically significant. If a hypothesis test produces both, these results will agree.
The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the corresponding confidence level is 95%.
For our example, the P value (0.031) is less than the significance level (0.05), which indicates that our results are statistically significant. Similarly, our 95% confidence interval [267 394] does not include the null hypothesis mean of 260 and we draw the same conclusion.
To understand why the results always agree, let’s recall how both the significance level and confidence level work.
Both the significance level and the confidence level define a distance from a limit to a mean. Guess what? The distances in both cases are exactly the same!
The distance equals the critical t-value * standard error of the mean. For our energy cost example data, the distance works out to be $63.57.
Imagine this discussion between the null hypothesis mean and the sample mean:
Null hypothesis mean, hypothesis test representative: Hey buddy! I’ve found that you’re statistically significant because you’re more than $63.57 away from me!
Sample mean, confidence interval representative: Actually, I’m significant because you’re more than $63.57 away from me!
Very agreeable aren’t they? And, they always will agree as long as you compare the correct pairs of P values and confidence intervals. If you compare the incorrect pair, you can get conflicting results, as shown by common mistake #1 in this post.
Closing Thoughts
In statistical analyses, there tends to be a greater focus on P values and simply detecting a significant effect or difference. However, a statistically significant effect is not necessarily meaningful in the real world. For instance, the effect might be too small to be of any practical value.
It’s important to pay attention to the both the magnitude and the precision of the estimated effect. That’s why I'm rather fond of confidence intervals. They allow you to assess these important characteristics along with the statistical significance. You'd like to see a narrow confidence interval where the entire range represents an effect that is meaningful in the real world.
If you like this post, you might want to read the previous posts in this series that use the same graphical framework: