In: Statistics and Probability
Design study a nonexperimental, and then answer the following
questions:
A. State your hypotheses and null hypotheses.
B. Identify and provide operational definitions of your
variables.
C. Identify your population and describe your sample selection
method and group assignment process.
D. Describe your nonexperimental design that you will use to test
your hypotheses. This is the most important part of the midterm.
Please be specific and clear.
E.What are some confounding variables and how will you control for
them?
it have to be QUASİ, use your own words
Research hypotheses versus statistical hypotheses
The first distinction that you need to keep clear in your mind is between research hypotheses and statistical hypotheses. In my ESP study, my overall scientific goal is to demonstrate that clairvoyance exists. In this situation, I have a clear research goal: I am hoping to discover evidence for ESP. In other situations I might actually be a lot more neutral than that, so I might say that my research goal is to determine whether or not clairvoyance exists. Regardless of how I want to portray myself, the basic point that I’m trying to convey here is that a research hypothesis involves making a substantive, testable scientific claim… if you are a psychologist, then your research hypotheses are fundamentally about psychological constructs. Any of the following would count as research hypotheses:
Notice that in practice, my research hypotheses could overlap a lot. My ultimate goal in the ESP experiment might be to test an ontological claim like “ESP exists”, but I might operationally restrict myself to a narrower hypothesis like “Some people can `see’ objects in a clairvoyant fashion”. That said, there are some things that really don’t count as proper research hypotheses in any meaningful sense:
As you can see, research hypotheses can be somewhat messy at times; and ultimately they are scientific claims. Statistical hypotheses are neither of these two things. Statistical hypotheses must be mathematically precise, and they must correspond to specific claims about the characteristics of the data generating mechanism (i.e., the “population”). Even so, the intent is that statistical hypotheses bear a clear relationship to the substantive research hypotheses that you care about! For instance, in my ESP study my research hypothesis is that some people are able to see through walls or whatever. What I want to do is to “map” this onto a statement about how the data were generated. So let’s think about what that statement would be. The quantity that I’m interested in within the experiment is P("correct")P("correct"), the true-but-unknown probability with which the participants in my experiment answer the question correctly. Let’s use the Greek letter θθ (theta) to refer to this probability. Here are four different statistical hypotheses:
All of these are legitimate examples of a statistical hypothesis because they are statements about a population parameter and are meaningfully related to my experiment.
What this discussion makes clear, I hope, is that when attempting to construct a statistical hypothesis test the researcher actually has two quite distinct hypotheses to consider. First, he or she has a research hypothesis (a claim about psychology), and this corresponds to a statistical hypothesis (a claim about the data generating population). In my ESP example, these might be
Dan.s.research.hypothesis | Dan.s.statistical.hypothesis |
---|---|
ESP.exists | θ≠0.5θ≠0.5 |
And the key thing to recognise is this: a statistical hypothesis test is a test of the statistical hypothesis, not the research hypothesis. If your study is badly designed, then the link between your research hypothesis and your statistical hypothesis is broken. To give a silly example, suppose that my ESP study was conducted in a situation where the participant can actually see the card reflected in a window; if that happens, I would be able to find very strong evidence that θ≠0.5θ≠0.5, but this would tell us nothing about whether “ESP exists”.
Null hypotheses and alternative hypotheses
So far, so good. I have a research hypothesis that corresponds to what I want to believe about the world, and I can map it onto a statistical hypothesis that corresponds to what I want to believe about how the data were generated. It’s at this point that things get somewhat counterintuitive for a lot of people. Because what I’m about to do is invent a new statistical hypothesis (the “null” hypothesis, H0H0) that corresponds to the exact opposite of what I want to believe, and then focus exclusively on that, almost to the neglect of the thing I’m actually interested in (which is now called the “alternative” hypothesis, H1H1). In our ESP example, the null hypothesis is that θ=0.5θ=0.5, since that’s what we’d expect if ESP didn’t exist. My hope, of course, is that ESP is totally real, and so the alternative to this null hypothesis is θ≠0.5θ≠0.5. In essence, what we’re doing here is dividing up the possible values of θθ into two groups: those values that I really hope aren’t true (the null), and those values that I’d be happy with if they turn out to be right (the alternative). Having done so, the important thing to recognise is that the goal of a hypothesis test is not to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false. Most people find this pretty weird.
The best way to think about it, in my experience, is to imagine that a hypothesis test is a criminal trial160… the trial of the null hypothesis. The null hypothesis is the defendant, the researcher is the prosecutor, and the statistical test itself is the judge. Just like a criminal trial, there is a presumption of innocence: the null hypothesis is deemed to be true unless you, the researcher, can prove beyond a reasonable doubt that it is false. You are free to design your experiment however you like (within reason, obviously!), and your goal when doing so is to maximise the chance that the data will yield a conviction… for the crime of being false. The catch is that the statistical test sets the rules of the trial, and those rules are designed to protect the null hypothesis – specifically to ensure that if the null hypothesis is actually true, the chances of a false conviction are guaranteed to be low. This is pretty important: after all, the null hypothesis doesn’t get a lawyer. And given that the researcher is trying desperately to prove it to be false, someone has to protect it.
Two types of errors
Before going into details about how a statistical test is constructed, it’s useful to understand the philosophy behind it. I hinted at it when pointing out the similarity between a null hypothesis test and a criminal trial, but I should now be explicit. Ideally, we would like to construct our test so that we never make any errors. Unfortunately, since the world is messy, this is never possible. Sometimes you’re just really unlucky: for instance, suppose you flip a coin 10 times in a row and it comes up heads all 10 times. That feels like very strong evidence that the coin is biased (and it is!), but of course there’s a 1 in 1024 chance that this would happen even if the coin was totally fair. In other words, in real life we always have to accept that there’s a chance that we did the wrong thing. As a consequence, the goal behind statistical hypothesis testing is not to eliminate errors, but to minimise them.
At this point, we need to be a bit more precise about what we mean by “errors”. Firstly, let’s state the obvious: it is either the case that the null hypothesis is true, or it is false; and our test will either reject the null hypothesis or retain it.161 So, as the table below illustrates, after we run the test and make our choice, one of four things might have happened:
retain H0H0 | retain H0H0 | |
---|---|---|
H0H0 is true | correct decision | error (type I) |
H0H0 is false | error (type II) | correct decision |
As a consequence there are actually two different types of error here. If we reject a null hypothesis that is actually true, then we have made a type I error. On the other hand, if we retain the null hypothesis when it is in fact false, then we have made a type II error.
Remember how I said that statistical testing was kind of like a criminal trial? Well, I meant it. A criminal trial requires that you establish “beyond a reasonable doubt” that the defendant did it. All of the evidentiary rules are (in theory, at least) designed to ensure that there’s (almost) no chance of wrongfully convicting an innocent defendant. The trial is designed to protect the rights of a defendant: as the English jurist William Blackstone famously said, it is “better that ten guilty persons escape than that one innocent suffer.” In other words, a criminal trial doesn’t treat the two types of error in the same way~… punishing the innocent is deemed to be much worse than letting the guilty go free. A statistical test is pretty much the same: the single most important design principle of the test is to control the probability of a type I error, to keep it below some fixed probability. This probability, which is denoted αα, is called the significance level of the test (or sometimes, the size of the test). And I’ll say it again, because it is so central to the whole set-up~… a hypothesis test is said to have significance level αα if the type I error rate is no larger than αα.
So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by ββ. However, it’s much more common to refer to the power of the test, which is the probability with which we reject a null hypothesis when it really is false, which is 1−β1−β. To help keep this straight, here’s the same table again, but with the relevant numbers added:
retain H0H0 | reject H0H0 | |
---|---|---|
H0H0 is true | 1−α1−α (probability of correct retention) | αα (type I error rate) |
H0H0 is false | ββ (type II error rate) | 1−β1−β (power of the test) |
A “powerful” hypothesis test is one that has a small value of ββ, while still keeping αα fixed at some (small) desired level. By convention, scientists make use of three different αα levels: .05.05, .01.01 and .001.001. Notice the asymmetry here~… the tests are designed to ensure that the αα level is kept small, but there’s no corresponding guarantee regarding ββ. We’d certainly like the type II error rate to be small, and we try to design tests that keep it small, but this is very much secondary to the overwhelming need to control the type I error rate. As Blackstone might have said if he were a statistician, it is “better to retain 10 false null hypotheses than to reject a single true one”. To be honest, I don’t know that I agree with this philosophy – there are situations where I think it makes sense, and situations where I think it doesn’t – but that’s neither here nor there. It’s how the tests are built.
Test statistics and sampling distributions
At this point we need to start talking specifics about how a hypothesis test is constructed. To that end, let’s return to the ESP example. Let’s ignore the actual data that we obtained, for the moment, and think about the structure of the experiment. Regardless of what the actual numbers are, the form of the data is that XX out of NN people correctly identified the colour of the hidden card. Moreover, let’s suppose for the moment that the null hypothesis really is true: ESP doesn’t exist, and the true probability that anyone picks the correct colour is exactly θ=0.5θ=0.5. What would we expect the data to look like? Well, obviously, we’d expect the proportion of people who make the correct response to be pretty close to 50%. Or, to phrase this in more mathematical terms, we’d say that X/NX/N is approximately 0.50.5. Of course, we wouldn’t expect this fraction to be exactly 0.5: if, for example we tested N=100N=100 people, and X=53X=53 of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if X=99X=99 of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong. Similarly, if only X=3X=3 people got the answer right, we’d be similarly confident that the null was wrong. Let’s be a little more technical about this: we have a quantity XX that we can calculate by looking at our data; after looking at the value of XX, we make a decision about whether to believe that the null hypothesis is correct, or to reject the null hypothesis in favour of the alternative. The name for this thing that we calculate to guide our choices is a test statistic.
Having chosen a test statistic, the next step is to state precisely which values of the test statistic would cause is to reject the null hypothesis, and which values would cause us to keep it. In order to do so, we need to determine what the sampling distribution of the test statistic would be if the null hypothesis were actually true (we talked about sampling distributions earlier in Section 10.3.1). Why do we need this? Because this distribution tells us exactly what values of XX our null hypothesis would lead us to expect. And therefore, we can use this distribution as a tool for assessing how closely the null hypothesis agrees with our data.
Figure 11.1: The sampling distribution for our test statistic XX when the null hypothesis is true. For our ESP scenario, this is a binomial distribution. Not surprisingly, since the null hypothesis says that the probability of a correct response is θ=.5θ=.5, the sampling distribution says that the most likely value is 50 (our of 100) correct responses. Most of the probability mass lies between 40 and 60.
How do we actually determine the sampling distribution of the test statistic? For a lot of hypothesis tests this step is actually quite complicated, and later on in the book you’ll see me being slightly evasive about it for some of the tests (some of them I don’t even understand myself). However, sometimes it’s very easy. And, fortunately for us, our ESP example provides us with one of the easiest cases. Our population parameter θθ is just the overall probability that people respond correctly when asked the question, and our test statistic XX is the count of the number of people who did so, out of a sample size of NN. We’ve seen a distribution like this before, in Section 9.4: that’s exactly what the binomial distribution describes! So, to use the notation and terminology that I introduced in that section, we would say that the null hypothesis predicts that XX is binomially distributed, which is writtenX∼Binomial(θ,N)X∼Binomial(θ,N)Since the null hypothesis states that θ=0.5θ=0.5 and our experiment has N=100N=100 people, we have the sampling distribution we need. This sampling distribution is plotted in Figure 11.1. No surprises really: the null hypothesis says that X=50X=50 is the most likely outcome, and it says that we’re almost certain to see somewhere between 40 and 60 correct responses.
Making decisions
Okay, we’re very close to being finished. We’ve constructed a test statistic (XX), and we chose this test statistic in such a way that we’re pretty confident that if XX is close to N/2N/2 then we should retain the null, and if not we should reject it. The question that remains is this: exactly which values of the test statistic should we associate with the null hypothesis, and which exactly values go with the alternative hypothesis? In my ESP study, for example, I’ve observed a value of X=62X=62. What decision should I make? Should I choose to believe the null hypothesis, or the alternative hypothesis?
Critical regions and critical values
To answer this question, we need to introduce the concept of a critical region for the test statistic XX. The critical region of the test corresponds to those values of XX that would lead us to reject null hypothesis (which is why the critical region is also sometimes called the rejection region). How do we find this critical region? Well, let’s consider what we know:
It’s important to make sure you understand this last point: the critical region corresponds to those values of XX for which we would reject the null hypothesis, and the sampling distribution in question describes the probability that we would obtain a particular value of XX if the null hypothesis were actually true. Now, let’s suppose that we chose a critical region that covers 20% of the sampling distribution, and suppose that the null hypothesis is actually true. What would be the probability of incorrectly rejecting the null? The answer is of course 20%. And therefore, we would have built a test that had an αα level of 0.20.2. If we want α=.05α=.05, the critical region is only allowed to cover 5% of the sampling distribution of our test statistic.
The critical region associated with the hypothesis test for the ESP study, for a hypothesis test with a significance level of α=.05α=.05. The plot itself shows the sampling distribution of XX under the null hypothesis: the grey bars correspond to those values of XX for which we would retain the null hypothesis. The black bars show the critical region: those values of XX for which we would reject the null. Because the alternative hypothesis is two sided (i.e., allows both θ<.5θ<.5 and θ>.5θ>.5), the critical region covers both tails of the distribution. To ensure an αα level of .05.05, we need to ensure that each of the two regions encompasses 2.5% of the sampling distribution.
As it turns out, those three things uniquely solve the problem: our critical region consists of the most extreme values, known as the tails of the distribution. This is illustrated in Figure 11.2. As it turns out, if we want α=.05α=.05, then our critical regions correspond to X≤40X≤40 and X≥60X≥60.162 That is, if the number of people saying “true” is between 41 and 59, then we should retain the null hypothesis. If the number is between 0 to 40 or between 60 to 100, then we should reject the null hypothesis. The numbers 40 and 60 are often referred to as the critical values, since they define the edges of the critical region.
At this point, our hypothesis test is essentially complete: (1) we choose an αα level (e.g., α=.05α=.05, (2) come up with some test statistic (e.g., XX) that does a good job (in some meaningful sense) of comparing H0H0 to H1H1, (3) figure out the sampling distribution of the test statistic on the assumption that the null hypothesis is true (in this case, binomial) and then (4) calculate the critical region that produces an appropriate αα level (0-40 and 60-100). All that we have to do now is calculate the value of the test statistic for the real data (e.g., X=62X=62) and then compare it to the critical values to make our decision. Since 62 is greater than the critical value of 60, we would reject the null hypothesis. Or, to phrase it slightly differently, we say that the test has produced a significant result.
A note on statistical “significance”
Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners.
– Attributed to G. O. Ashley163
A very brief digression is in order at this point, regarding the word “significant”. The concept of statistical significance is actually a very simple one, but has a very unfortunate name. If the data allow us to reject the null hypothesis, we say that “the result is statistically significant”, which is often shortened to “the result is significant”. This terminology is rather old, and dates back to a time when “significant” just meant something like “indicated”, rather than its modern meaning, which is much closer to “important”. As a result, a lot of modern readers get very confused when they start learning statistics, because they think that a “significant result” must be an important one. It doesn’t mean that at all. All that “statistically significant” means is that the data allowed us to reject a null hypothesis. Whether or not the result is actually important in the real world is a very different question, and depends on all sorts of other things.
The difference between one sided and two sided tests
There’s one more thing I want to point out about the hypothesis test that I’ve just constructed. If we take a moment to think about the statistical hypotheses I’ve been using,H0:θ=.5H1:θ≠.5H0:θ=.5H1:θ≠.5we notice that the alternative hypothesis covers both the possibility that θ<.5θ<.5 and the possibility that θ>.5θ>.5. This makes sense if I really think that ESP could produce better-than-chance performance or worse-than-chance performance (and there are some people who think that). In statistical language, this is an example of a two-sided test. It’s called this because the alternative hypothesis covers the area on both “sides” of the null hypothesis, and as a consequence the critical region of the test covers both tails of the sampling distribution (2.5% on either side if α=.05α=.05), as illustrated earlier in Figure
However, that’s not the only possibility. It might be the case, for example, that I’m only willing to believe in ESP if it produces better than chance performance. If so, then my alternative hypothesis would only covers the possibility that θ>.5θ>.5, and as a consequence the null hypothesis now becomes θ≤.5θ≤.5:H0:θ≤.5H1:θ>.5H0:θ≤.5H1:θ>.5When this happens, we have what’s called a one-sided test, and when this happens the critical region only covers one tail of the sampling distribution. This is illustrated in Figure
Figure The critical region for a one sided test. In this case, the alternative hypothesis is that θ>.05θ>.05, so we would only reject the null hypothesis for large values of XX. As a consequence, the critical region only covers the upper tail of the sampling distribution; specifically the upper 5% of the distribution. Contrast this to the two-sided version earlier)
The pp value of a test
In one sense, our hypothesis test is complete; we’ve constructed a test statistic, figured out its sampling distribution if the null hypothesis is true, and then constructed the critical region for the test. Nevertheless, I’ve actually omitted the most important number of all: the pp value. It is to this topic that we now turn. There are two somewhat different ways of interpreting a pp value, one proposed by Sir Ronald Fisher and the other by Jerzy Neyman. Both versions are legitimate, though they reflect very different ways of thinking about hypothesis tests. Most introductory textbooks tend to give Fisher’s version only, but I think that’s a bit of a shame. To my mind, Neyman’s version is cleaner, and actually better reflects the logic of the null hypothesis test. You might disagree though, so I’ve included both. I’ll start with Neyman’s version…