In: Statistics and Probability
Describe and explain the correct use of each of the following four devices for determining whether a variable is normally distributed: frequency histogram, normal probability (Q-Q) plot, Shapiro-Wilks (W) test, skewness test.
1) Frequency Histogram :
The histogram is a data visualization that shows the distribution of a variable. It gives us the frequency of occurrence per value in the data set, which is what distributions are about.
Classical bell-shaped, symmetric histogram with most of the frequency counts bunched in the middle and with the counts dying off out in the tails indicates that the data is normally distributed.
2) Normal probability (Q-Q) Plot:
A normal probability plot, or more specifically a quantile-quantile (Q-Q) plot, shows the distribution of the data against the expected normal distribution. The Q-Q plot plots every observed value against a standard normal distribution with the same number of points.
For normally distributed data, observations should lie approximately on a straight line in Q-Q plot. If the data is non-normal, the points form a curve that deviates markedly from a straight line. Possible outliers are points at the ends of the line, distanced from the bulk of the observations.
3) Shapiro - Wilks test:
Shapiro Wilk test is used to test that the data is normal or non normal.
Null hypothesis for this test is that : Data is normal
Alternative hypothesis: Data is non normal.
The test rejects the hypothesis of normality when the p value is less than or equal to 0.05. Failing the normality test allows you to state with 95% confidence the data does not fit the normal distribution. Passing the normality test only allows you to state no significant departure from normality was found.
4) Skewness test:
Skewness is a measure of the asymmetry of the probability distribution of a random variable about its mean. In other words, skewness tells you the amount and direction of skew (departure from horizontal symmetry).
The skewness value can be positive or negative, or even undefined. If skewness is 0, the data are perfectly symmetrical, although it is quite unlikely for real-world data.
General rule of thumb:
a) If skewness is less than -1 or greater than 1, the distribution is highly skewed.
b) If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
c) If skewness is between -0.5 and 0.5, the distribution is approximately symmetric.
If skewness is not close to zero, then your data set is not normally distributed.