In: Statistics and Probability
Load “Lock5Data” into your R console. Load “OlympicMarathon”
data set in
“Lock5Data”. This data set contains population of all times to
finish the 2008
Olympic Men’s Marathon.
a) What is the population size?
b) Now using “Minutes” column generate a random
sample of size 5.
c) Calculate the sample mean and record it (create
a excel sheet or write a
direct R program to record this)
d) Continue steps (b) and (c) 10,000 time (that
mean you have recorded 10,000
sample means)
What you have in step (d) is closely resemble to distribution of
sample means
with sample size 5.
e) Calculate the mean of 10,000 sample
means.
f) Calculate the population mean (that mean using
all data in “Minutes” column)
g) According to the central limit theorem, if
conditions satisfied, then the mean of
distribution of sample mean should be close to the population mean.
Now
compare your results for part (e) and (f). Are they same or at
least close to
each other?
h) Calculate the standard deviation for 10,000
data points you have from above.
i) Now calculate theoretical standard error using
the formula !
√! . Here ? is the
standard deviation using all “Minutes” data and ? is the sample
size (which is
equal to 5 in this case)
j) Comment about your results in part (h) and
(i)
k) Graph your 10,000 records in a histogram
l) Is your histogram close to a normal
distribution shape?
m) According to the central limit theorem, if
sample size is large enough then
distribution of sample means is close to normal distribution. Let
increase the
sample size to see whether this is true or not. Use sample size 40
and repeat
steps (b), (c) and (d) again. Create a histogram for this new data
set. Is your
histogram shape look like normal distribution?
Read OlympicMarathon data
R-code
library(Lock5Data)
data("OlympicMarathon")
mardata <- OlympicMarathon
a) Find Population Size
R-code
nrow(mardata)
R-output
85
(Population size is 85)
b) Using minutes column generate random sample of size 5
R-code
set.seed(23948)
ssize = 5
sam1<-sample(OlympicMarathon$Minutes, ssize, replace=FALSE,
prob=NULL)
sam1
R-output
139.00 141.42 137.32 135.58 140.17
c) Calculate Sample Mean and Record it
R-code
m1 = mean(sam1)
m1
R-output
138.698
(Mean of the sample is 138.698
d) Generate 10000 samples each of size 5 and calculate mean for each sample
R-code
# 10000 samples of size 5 each
ssize = 5
iterations = 10000
sam <- replicate(iterations, sample(OlympicMarathon$Minutes,
ssize, replace=FALSE, prob=NULL))
#mn is the array for storing means
mn <- rep(0, iterations)
for (i in 1:iterations)
{
mn[i] = mean(sam[,i])
}
e) Calculate Mean of 10000 sample means
R-code
sam_mean = mean(mn)
cat ("Sample mean = ", sam_mean, "\n")
R-output
Sample mean = 140.5635
f) Calculate Population Mean
R-code
pop_mean = mean(OlympicMarathon$Minutes)
cat ("Population mean = ", pop_mean)
R-output
Population mean = 140.5918
g)
Sample mean = 140.5635
Population mean = 140.5918
Thus, sample mean and population mean are very close to each other
h) Standard Deviation of 10000 data points
R-code
sam_sd = sd(mn)
cat ("Standard Deviation of 10000 datapoints = ", sam_sd,
"\n")
R-output
Standard Deviation of 10000 data points = 3.488243
i)
R-code
# Population standard deviation
sigma = sd(OlympicMarathon$Minutes)
# Theoretical standard deviation
theory_sd = sigma/sqrt(ssize)
cat ("Theoretical Standard Deviation = ", theory_sd, "\n")
R-output
Theoretical Standard Deviation = 3.58105
j) Theoretical Standard Deviation = 3.58105
Standard Deviation of 10000 data points = 3.488243
These two standard deviations are also very close.
k) Graph a histogram of 10000 datapoints
R-code
#Graph Histogram
hist(mn, main = "Histogram of 10000 samples of size 5 of Minutes",
xlab = "Minutes", ylab = "Frequency")
R-output
l) Histogram is skewed to the right and not a normal distribution shape
m)
For this, we change the variable ssize to 40 in all the code above and then plot the histogram
Histogram given below for sample size = 40
This is a Normal Distribution Shaped histogram
Hence, the Central Limit Theorem holds good.