In: Computer Science
Using NumPy etc. How to calculate a population mean and population standard deviation of a column with values. Also, how to calculate the range of values around the mean that includes 95% confidence interval of values within the column.
numpy.mean()
Arithmetic mean is the sum of elements along an axis divided by the number of elements. The numpy.mean() function returns the arithmetic mean of elements in the array. If the axis is mentioned, it is calculated along it.
Example:-
import numpy as np a = np.array([[1,2,3],[3,4,5],[4,5,6]]) print 'Our array is:' print a print '\n' print 'Applying mean() function:' print np.mean(a) print '\n' print 'Applying mean() function along axis 0:' print np.mean(a, axis = 0) print '\n' print 'Applying mean() function along axis 1:' print np.mean(a, axis = 1)
It will produce the following output −
Our array is: [[1 2 3] [3 4 5] [4 5 6]] Applying mean() function: 3.66666666667 Applying mean() function along axis 0: [ 2.66666667 3.66666667 4.66666667] Applying mean() function along axis 1: [ 2. 4. 5.]
Standard Deviation
Standard deviation is the square root of the average of squared deviations from mean. The formula for standard deviation is as follows −
std = sqrt(mean(abs(x - x.mean())**2))
If the array is [1, 2, 3, 4], then its mean is 2.5. Hence the squared deviations are [2.25, 0.25, 0.25, 2.25] and the square root of its mean divided by 4, i.e., sqrt (5/4) is 1.1180339887498949.
Example:-
import numpy as np print np.std([1,2,3,4])
It will produce the following output −
1.1180339887498949
95% confidence interval
A 95% confidence interval means that if we were to take 100 different samples and compute a 95% confidence interval for each sample, then approximately 95 of the 100 confidence intervals will contain the true mean value (μ). In practice, however, we select one random sample and generate one confidence interval, which may or may not contain the true mean. The observed interval may over- or underestimate μ. Consequently, the 95% CI is the likely range of the true, unknown parameter. The confidence interval does not reflect the variability in the unknown parameter. Rather, it reflects the amount of random error in the sample and provides a range of values that are likely to include the unknown parameter. Another way of thinking about a confidence interval is that it is the range of likely values of the parameter (defined as the point estimate + margin of error) with a specified level of confidence (which is similar to a probability).
Suppose we want to generate a 95% confidence interval estimate for an unknown population mean. This means that there is a 95% probability that the confidence interval will contain the true population mean. Thus, P( [sample mean] - margin of error < μ < [sample mean] + margin of error) = 0.95.