In: Computer Science
Explain how a data analyst would interpret the boxplot. What does it mean when you look at it?
box plot is a very powerful tool that we have for understanding our data. Using box plots we can better understand our data by understanding its distribution, outliers, mean, median and variance. Box plot packs all of this information about our data in a single concise diagram. It allows us to understand the nature of our data at a single glance.
Consider the diagram below:
Every box-plot has two parts, a box and whiskers as you can see in the figure above. That’s why it is also sometimes called the box and whiskers plot. The start of the box i.e the lower quartile represents the 25% of our data set. So by looking at the diagram we can instantly conclude that 25% of our data has a value less than 6.2, similarly the end of the box i.e the upper quartile represents 75% of our data. So again from the diagram we can conclude that 75% of our data is less than 8.8. The bold black line in the box represents the median value of our data. In our example the median lies at about 7.8. The difference between the lower quartile and upper quartile is called the inter-quartile range. So basically the entire red box represents the inter-quartile range.
The following diagram will explain the quartiles even further:
Now for outliers
Now lets talk about the whiskers of boxplot and how do we visualize outliers in a boxplot. In box plot the whiskers are generally defined as 1.5 times the inter-quartile range. Anything this outside the whiskers is considered as an outlier.
Identify Skewness
We can also identify the skewness of our data by observing the shape of the box plot. If the box plot is symmetric it means that our data follows a normal distribution. If our box plot is not symmetric it shows that our data is skewed. You can get a better understanding by looking at the diagrams below:
Here is a box plot with respect to the distribution curve: