In: Math
Discuss the three properties (characteristics) of data and explain some of the descriptive measures associated with each property.
The three properties of data are
a) Central Tendency of data
b) Dispersion of data
c) Correlation of data
a) Central Tendency of data :
A measure of central tendency (also referred to as measures of
center or central location) is a summary measure that attempts to
describe a whole set of data with a single value that represents
the middle or center of its distribution.
There are three main measures of central tendency: mean,
median and mode. Each of these measures describes a
different indication of the typical or central value in the
distribution.
Mean : The mean is the sum of the value of each observation in a dataset divided by the number of observations. This is also known as the arithmetic average.
Looking at the retirement age distribution :
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
The mean is calculated by adding together all the values
(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number
of observations (11) which equals 56.6 years.
Median :The median is the middle value in
distribution when the values are arranged in ascending or
descending order.
The median divides the distribution in half (there are 50% of
observations on either side of the median value). In a distribution
with an odd number of observations, the median value is the middle
value.
Looking at the retirement age distribution (which has 11
observations), the median is the middle value, which is 57
years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the
median value is the mean of the two middle values. In the following
distribution, the two middle values are 56 and 57, therefore the
median equals 56.5 years:
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Mode : The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people,
in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement
age data.
Age |
Frequency |
54 |
3 |
55 |
1 |
56 |
1 |
57 |
2 |
58 |
2 |
60 |
2 |
The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
b) Dispersion of data :
The measures of central tendency are not adequate to describe data. Two data sets can have the same mean but they can be entirely different. Thus to describe data, one needs to know the extent of variability. This is given by the measures of dispersion. Range, standard deviation are the two commonly used measures of dispersion.
Range : The difference between the lowest and highest values.
In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9, so the range is 9 − 3 = 6.
standard deviation : The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. It is calculated as the square root of variance by determining the variation between each data point relative to the mean. If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.
The Formula for Standard Deviation :
Example : A standard deviation is the “average” difference between the data points and the average of those data points. If the average of 8, 9, 10, 11, and 12 is 10 (8+9+10+11+12 = 50. 50/5 = 10), what is the average distance of those numbers from 10. 8 is 2 away, 9 is 1 away, 10 is 0 away, 11 is 1 away, and 12 is 2 away. So you add those numbers up (2+1+0+1+2 = 6) and divide them by the number of data points we examined (6/5 = 1.2). In our case, the Mean = 10 and the Standard Deviation = 1.2.
c) Correlation of data:
Correlation is a bivariate analysis that measures the strength of association between two variables and the direction of the relationship. In terms of the strength of relationship, the value of the correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The direction of the relationship is indicated by the sign of the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship. Usually, in statistics, we measure four types of correlations: Pearson r correlation is mostly used in measuring correlation.
Pearson r correlation: Pearson r correlation is the most widely used correlation statistic to measure the degree of the relationship between linearly related variables. For example, in the stock market, if we want to measure how two stocks are related to each other, Pearson r correlation is used to measure the degree of relationship between the two. The following formula is used to calculate the Pearson r correlation:
rxy = Pearson r correlation coefficient
between x and y
n = number of observations
xi = value of x (for ith observation)
yi = value of y (for ith observation)