In: Statistics and Probability
In an effort to characterize the New Guinea crocodile (Crocodylus novaeguineae), measurements were taken of the dorsal cranial length (mm) (the length of the skull from the tip of the nose to the back of the cranial cap, denoted DCL) and the total length (cm) (denoted TL) of 50 harvested adult males. (Data on next page.)
1. Construct a histogram and a boxplot for each of the variables DCL and TL. Comment on the symmetry of the distribution of each variable.
2. Construct a scatterplot with DCL on the vertical axis and TL on the horizontal axis. Based on this plot, what can be said about the relationship between DCL and TL (i.e. if you vary DCL, what happens to TL)?
Data
TL DCL Observation
130 169 1
102 154 2
126 160 3
230 290 4
115 151 5
150 209 6
259 344 7
130 183 8
110 153 9
130 183 10
185 237 11
215 288 12
129 187 13
149 189 14
156 203 15
100 143 16
224 294 17
234 318 18
162 229 19
217 299 20
206 283 21
144 198 22
146 203 23
166 229 24
203 275 25
205 266 26
252 350 27
238 318 28
250 330 29
255 351 30
120 169 31
250 332 32
238 307 33
157 205 34
159 216 35
202 261 36
177 237 37
221 288 38
224 294 39
167 232 40
240 316 41
207 268 42
192 242 43
180 248 44
165 226 45
197 267 46
113 162 47
131 183 48
162 234 49
246 310 50
Following is the raw data for our analysis given in a tabular form:
TL | DCL |
130 | 169 |
102 | 154 |
126 | 160 |
230 | 290 |
115 | 151 |
150 | 209 |
259 | 344 |
130 | 183 |
110 | 153 |
130 | 183 |
185 | 237 |
215 | 288 |
129 | 187 |
149 | 189 |
156 | 203 |
100 | 143 |
224 | 294 |
234 | 318 |
162 | 229 |
217 | 299 |
206 | 283 |
144 | 198 |
146 | 203 |
166 | 229 |
203 | 275 |
205 | 266 |
252 | 350 |
238 | 318 |
250 | 330 |
255 | 351 |
120 | 169 |
250 | 332 |
238 | 307 |
157 | 205 |
159 | 216 |
202 | 261 |
177 | 237 |
221 | 288 |
224 | 294 |
167 | 232 |
240 | 316 |
207 | 268 |
192 | 242 |
180 | 248 |
165 | 226 |
197 | 267 |
113 | 162 |
131 | 183 |
162 | 234 |
246 | 310 |
Note : All the required plots namely, the histogram, the boxplot and the scatterplot are constructed using Python's Seaborn library in order to complete the project within the time bounds . The steps for manually plotting the same shall be mentioned for a refference:
A. Following steps must be followed in order to construct a histogram for a given data atteribute :
1- Find the smallest and the largest number in the given data.
2- Find the range of the data by subtracting the largest number from the smallest number.
3- Now, we need to find the width of our class using the range. There is no hard-and-fast rule for this.So, we do this intituively by deviding the range by 5 for small ranges any by 10 in large ranges e.g, in our case for the atteribute TL, we devide the range by 10 and round off to obtain a class of width 16 units.
4- Make a frequency table having these two columns:
a) Intervals of the class width ,starting from the minimum to the maximum as per the class width and
b) The number of data points existing in that particular interval.
5- Draw perpendicular lines, the x- and y-axes. Place the frequencies on y-axis and the lower value of respective intervals on the x-axis.with the atteribute name.
6- Draw the bars whose width is from the lower value of first interval to the lower value of the second interval, and so on.
B. Following steps must be followed in order to construct a boxplot for a given data atteribute :
1-Find the min and max values of the atteribute.
2- Calculate the following 3 descriptive quantities of the atteribute:
a) median (Q2)
Formula:
- Sort the data in ascending order.
- If the total number of data-points is an odd number, the median is the {(n+1)/2}th observation .
- If the total number of data-points is an even number, the median is the average of the (n/2)th and the {(n+1)/2}th observations.
b) 25-th percentile (Q1)
c) 75-th percentile (Q3)
3- Draw a horizontal rectangular box and lets its first edge be Q1 and the second edge be Q3.
4- Devide the rectangle by drawing a vertical line inside it, the median, whose distance from Q1 and Q3 scaled accordin to its magnitude relative to them. Call it Q2
5- Extend horizontal lines on both the sides of scale relative to min and the max valies on the Q1 and Q3, respectively. Cal them min and max
C. Following steps must be followed in order to construct a scatterplot for a given data atteributes :
1- Draw perpendicular lines, the x- and y-axes. Place the first atteribute name on y-axis and the second atteribute name on the x-axis.
2. Make dots corresponding to the adjecent observation pairs with respect to both the axes.
1. Construct a histogram and a boxplot for each of the variables DCL and TL. Comment on the symmetry of the distribution of each variable.
ANS)
DCL atteribute uni-variate analysis:
a. Histogram:
b. Boxplot
OBSERVATIONS:
1- The data is about-normally distributed with a marginal right-skew.
2- Since the skew is minimal, no extreme values are present at all.
3 - Since more than one peaks are appearent, the distribution is bi-modal, hinting in presence two independent groups or clusters.
4- Since the distribution is not having a shark peak, it is a plati-kurtic (-ve kurtosis) distribution citing the fact that the values are not very centered towatds the mean but have a healthy spread.
5- The range of values being 200, is significant citing high variation on length of species.
TL atteribute uni-variate analysis:
a. Histogram:
b. Boxplot
OBSERVATIONS:
1- The data is about-normally distributed with a very marginal right-skew.
2- Since the skew is minimal, no extreme values are present at all in this atteribute as well
3 - Since more than one peaks are appearent, the distribution is bi-modal, high chances of presence two independent groups .
4- It is a plati-kurtic (-ve kurtosis) distribution citing the fact tthat it has a healthy spread, and not centered towards the mean.
5- The range of values being 160, is significant citing high variation on length but less varience than DCL atteribute.
Conclusion can be made that both these atteributes are having roughly identical normal distribition , with to hidden groups.
2. Construct a scatterplot with DCL on the vertical axis and TL on the horizontal axis. Based on this plot, what can be said about the relationship between DCL and TL (i.e. if you vary DCL, what happens to TL)?
OBSERVATIONS
The scatterplot between the variables TL and DCL suggest a very high level of correlation. I.e, there is a positive dependecy between the both quantities. So, with a given increase in the TL, the DCL will probably increase as well and vice-versa.