In: Statistics and Probability
Consider the example:
It's Friday night and you want to watch a movie. There are three movies that caught your eye, but you're not really sure if they're good or not. The ratings are given in following figure
The Emoji Movie might not be that appealing, and there's a clear competition between Interstellar and Star Wars …
To clear out any questions about which movie rated as best, you decide to run some statistical tests and compare the three rating distributions.
Hypothesis Testing
1. Defining Your Hypothesis
For you Friday movie night, what you really want to know is if one movie is significantly better than the others. In this case, you can build your hypothesis on the difference between the average rating your friends gave to each movie.
Which you can read as Null Hypothesis (H0): The mean of movie A is equal to the mean of movie B and Alternative Hypothesis (H1): The mean of movie A is not equal to the mean of movie B.
2. Set the Significance Level of the Statistical Test
The goal of the statistical test is to try to Reject the Null Hypothesis, which states there's no observable change or behaviour
In your Friday night movie quest, not identifying a good movie to watch has very minimal consequences: some potentially wasted time, and a bit of frustration. But you can see the importance of setting the appropriate significance level in scenarios like clinical trials, where you're testing a new drug or treatment.
The significance levels that are normally used are 1% and 5%.
For this movie night pick we can settle at 5%, i.e., alpha = 0.05.
What Statistical Test To Use?
Welch’s t-Test
This is also called the unequal variances t-test. It’s an adaptation of Student’s t-Test and still requires the data to be normally distributed. However, it takes into account both variances when computing the test.
In the Friday night movie example, the size of the dataset is going to be the same for both movies. But with Welch's t-test, we make sure that the variance of each rating distribution is factored in when verifying if there is significant difference between ratings.
With the Welch’s t-Test, and for each for each pair of distributions, you calculate the test statistic, which every statistical software generates once you run the test.
Now, the significance level comes back to action, because you’re ready to draw a conclusion about the data.
Alongside the test statistic, your software of choice will also provide you with the p-value. Also expressed as probability, the p-value is the probability of observing a value as extreme as the test statistic, given that the Null Hypothesis is true.
In this Friday movie night scenario, the p-value would be the probability of having a mean rating so much higher or so lower than the one we’re comparing to.
You ran the test, got the test statistic and the p-value and now you can use the p-value and the significance level to determine if there’s a statistically significant difference between the dataset.
Crunching all the data with the statistical software of your choice you get the following results
Interstellar vs The Emoji Movie
The Emoji Movie vs Star Wars: The Last Jedi
Looking at the absolute value of the test-statistics above, given that they're so large, you can conclude that there's significant difference between the two pairs movies.
Comparing the significance level with each p-value, you can safely reject the Null Hypothesis, which states that there's no difference between the mean rating of these movies.
This applies to both Interstellar vs The Emoji Movie and The Emoji Movie vs Star Wars: The Last Jedi, because in both cases the p-value is much smaller than the significance level of 0.05 we set before running the test.
You just concluded that there’s actually a significant difference between the average rating of The Emoji Movie (2.2 units) compared with both Interstellar (4.35 units) and Star Wars (4.5 units).
Given that the average rating of the latter movies is significantly higher you can safely exclude The Emoji Movie from you candidate list.
Now there are only two contestants left …
Interstellar vs Star Wars: The Last Jedi
From these results you can't prove that there is statistically significant difference between these two movies. If you recall, their average rating is very close — 4.35 compared to 4.5 units.
Even though it's tempting to say the Null Hypothesis is true, and that there is no difference between the two means, you can't.
What you can say is that you don't have enough empirical evidence to reject the Null Hypothesis.
If you want to abide to the Statistics rules, you'd have a technical tie
As a tie-breaker you could ask the opinion of a unbiased third-party or just watch the one that has the highest average rating.