In: Statistics and Probability
Pls attempt both parts for UPVOTE
a) For the following dataset:-
A | B | C |
82.95406 | 48.596 | 62.83925 |
80.2694 | 94.88806 | 69.11351 |
51.32409 | 87.00438 | 5.26083 |
84.40903 | 73.14477 | 67.37821 |
7.744191 | 70.83899 | 30.42249 |
70.09185 | 96.19882 | 35.38787 |
27.85478 | 70.86354 | 10.82541 |
36.31444 | 54.53047 | 94.30487 |
78.58975 | 88.44509 | 91.97403 |
78.83427 | 97.59331 | 67.44993 |
40.58147 | 62.05577 | 67.98824 |
5.522503 | 0.005762 | 28.78233 |
51.0516 | 75.53139 | 82.53751 |
22.99913 | 6.099075 | 16.05481 |
37.90452 | 78.80319 | 33.0078 |
90.42208 | 68.23812 | 86.88297 |
59.52895 | 23.34578 | 8.346984 |
96.59504 | 52.17967 | 75.20052 |
98.23697 | 87.31435 | 97.50355 |
56.66422 | 25.66281 | 27.79151 |
16.59429 | 84.47958 | 61.71686 |
53.90397 | 10.89486 | 93.26763 |
55.11838 | 13.11304 | 75.92159 |
71.32999 | 70.36975 | 10.86584 |
40.88035 | 84.11119 | 97.83293 |
88.07786 | 10.15206 | 76.98687 |
86.25806 | 68.54747 | 98.22674 |
14.63472 | 37.58765 | 68.50834 |
48.94452 | 77.09557 | 45.1666 |
83.50869 | 20.72787 | 33.30376 |
59.14445 | 55.82262 | 96.20811 |
1.253421 | 18.14296 | 71.29829 |
32.03952 | 22.48347 | 1.707322 |
82.10399 | 54.66754 | 71.42761 |
1.551587 | 88.15809 | 13.04672 |
55.40726 | 71.10242 | 10.2861 |
66.0299 | 17.13271 | 90.60817 |
70.02227 | 49.47755 | 9.984934 |
11.2358 | 99.71097 | 2.637771 |
54.2171 | 64.7902 | 28.8158 |
Please examine Does 68.26% of the data fall within one standard deviation of the mean value? Does 95% of the data fall within 2 standard deviations? Does 99.7% of the data fall within 3 standard deviations? Does the statistical analysis behave as expected?
Please also create a Histogram for data. Use 10 “bins” in your histogram.
b)
For the following dataset:-
A | B |
65.77503 | 64.79644 |
87.57873 | 81.42366 |
69.16423 | 47.8631 |
78.7769 | 74.97734 |
39.29159 | 36.33522 |
83.14534 | 67.22618 |
49.35916 | 36.51458 |
45.42246 | 61.71659 |
83.51742 | 86.33629 |
88.21379 | 81.29251 |
51.31862 | 56.87516 |
2.764132 | 11.43687 |
63.2915 | 69.70683 |
14.5491 | 15.05101 |
58.35385 | 49.90517 |
79.3301 | 81.84772 |
41.43736 | 30.40724 |
74.38735 | 74.65841 |
92.77566 | 94.35162 |
41.16351 | 36.70618 |
50.53694 | 54.26358 |
32.39941 | 52.68882 |
34.11571 | 48.051 |
70.84987 | 50.85519 |
62.49577 | 74.27482 |
49.11496 | 58.4056 |
77.40276 | 84.34409 |
26.11119 | 40.24357 |
63.02004 | 57.0689 |
52.11828 | 45.84677 |
57.48353 | 70.39173 |
9.698192 | 30.23156 |
27.26149 | 18.74344 |
68.38577 | 69.39971 |
44.85484 | 34.25214 |
63.25484 | 45.59859 |
41.5813 | 57.92359 |
59.74991 | 43.16158 |
55.47338 | 37.86151 |
59.50365 | 49.27436 |
Determine the mean and standard deviations for these new sets and create histograms for them as well. Examine the standard deviation ranges and comment on whether these new sets provide a “Normal” distribution.
R code:
#Reading the data
x=read.csv(file.choose(),header = FALSE)
v=x$V1
s=sd(x$V1)
m=mean(x$V1)
#Percentage of values that lie within one sd
length(which(v<m+s & v>m-s))/length(v)*100
#Percentage of values that lie within two sd
s1=2*sd(x$V1)
length(which(v<m+s1 & v>m-s1))/length(v)
#Percentage of values that lie within 3 sd
s2=3*sd(x$V1)
length(which(v<m+s2 & v>m-s2))/length(v)
#Check whether data is normal
qqnorm(v)
From the above code, we can see that 58.33333% values lie within one sd from the mean
But , 100% values lie within two sd from the mean.
Also 100% values lie within three sd from the mean.
The reason of this is there is very high standard deviation of the data.
The concept that 68.26,95,99.7 % values lie within one,two,three sd from the mean is only valid if the data is normal.
So, we checked the qqplot of the data i.e. quantile quantile plot which showed the data does not follow normal distribution
Histogram R code:
min(v)
max(v)
hist(v,breaks=c(0,10,20,30,40,50,60,70,80,90,100),main="Histogram
of data",xlab="Bins")
Output:
From the histogram also it is evident that the data do not follow normal distribution