In: Statistics and Probability
I have to clean up a data spreadsheet to make a histogram, bar chart etc. What my question is, is if a number on my spreadsheet said 10-15 or 20+ shouldn't I just change those to 15 and 20. Most numbers in my column are whole numbers.
No that won't be necessary to change that from 10-15 to just 15. In fact, that will be wrong to do it.
A histogram tells you how many numbers of items lie in that range. You can create a range like 1
0.5-5.5
5.5-10.5
10.5-15.5
15.5-20.5
20.5-25.5
Remember the upper limit is not included in the interval.
Here is an example of creating a histogram.
Here is the data on starting salaries of a group of 54 people. When constructing a histogram it is helpful to sort the observations.
8870 10800 12000 12500 13000 14000 15000 16000 16500 16600 16700 16900 16900 17000 17000 17600 17880 18000 18000 18000 18000 18000 18000 18000 18000 18000 18000 18500 18680 19100 20000 20000 20000 20000 20000 20300 20900 22000 23000 23000 23000 23000 23400 24000 25000 25000 26000 26000 27000 30000 30000 32500 37000 48785
Minimum = 8870 Maximum = 48785 Range = 39915.
First, decide how many intervals you would like. A thumb rule is to use the square root of the number of observations then round it up. Here, that is the square root of 54 = 7.34; round up and use 8.
The interval width should then be approximately equal to the range divided by the number of intervals. Range/number of Intervals = 39915/8 = 4989.375; I'll round up to the conveniently even figure of 5000.
Start the first interval at a convenient value below the minimum. Here the minimum is 8870, so we begin at 7500.
The intervals then begin at 7500 and have a width of 5000. So, the first interval runs from 7500 to 12500, the second from 12500 to 17500 and so on. By convention, we agree that an interval includes the lower boundary point, but does not include the upper boundary point. So, for instance, a value of 7500 falls in the (7500, 12500) interval, but a value of 12500 does not. A value of 12500 falls instead in the (12500, 17500) interval.
Construct a simple table including each interval, the count of observations in that interval and the relative frequency or percentage of observations in the interval.