Question

In: Statistics and Probability

Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in “Lock5Data”. This data set contains...

Load “Lock5Data” into your R console. Load “OlympicMarathon” data set in
“Lock5Data”. This data set contains population of all times to finish the 2008
Olympic Men’s Marathon.
a) What is the population size?
b) Now using “Minutes” column generate a random sample of size 5.
c) Calculate the sample mean and record it (create a excel sheet or write a
direct R program to record this)
d) Continue steps (b) and (c) 10,000 time (that mean you have recorded 10,000
sample means)
What you have in step (d) is closely resemble to distribution of sample means
with sample size 5.
e) Calculate the mean of 10,000 sample means.
f) Calculate the population mean (that mean using all data in “Minutes” column)
g) According to the central limit theorem, if conditions satisfied, then the mean of
distribution of sample mean should be close to the population mean. Now
compare your results for part (e) and (f). Are they same or at least close to
each other?
h) Calculate the standard deviation for 10,000 data points you have from above.
i) Now calculate theoretical standard error using the formula !
√! . Here ? is the
standard deviation using all “Minutes” data and ? is the sample size (which is
equal to 5 in this case)

j) Comment about your results in part (h) and (i)
k) Graph your 10,000 records in a histogram
l) Is your histogram close to a normal distribution shape?
m) According to the central limit theorem, if sample size is large enough then
distribution of sample means is close to normal distribution. Let increase the
sample size to see whether this is true or not. Use sample size 40 and repeat
steps (b), (c) and (d) again. Create a histogram for this new data set. Is your
histogram shape look like normal distribution?

Solutions

Expert Solution

Read OlympicMarathon data

R-code

library(Lock5Data)

data("OlympicMarathon")

mardata <- OlympicMarathon

a) Find Population Size

R-code

nrow(mardata)

R-output

85

(Population size is 85)

b) Using minutes column generate random sample of size 5

R-code

set.seed(23948)

ssize = 5
sam1<-sample(OlympicMarathon$Minutes, ssize, replace=FALSE, prob=NULL)
sam1

R-output

139.00 141.42 137.32 135.58 140.17

c) Calculate Sample Mean and Record it

R-code

m1 = mean(sam1)
m1

R-output

138.698

(Mean of the sample is 138.698

d) Generate 10000 samples each of size 5 and calculate mean for each sample

R-code

# 10000 samples of size 5 each
ssize = 5
iterations = 10000
sam <- replicate(iterations, sample(OlympicMarathon$Minutes, ssize, replace=FALSE, prob=NULL))

#mn is the array for storing means
mn <- rep(0, iterations)
for (i in 1:iterations)
{

mn[i] = mean(sam[,i])

}

e) Calculate Mean of 10000 sample means

R-code
sam_mean = mean(mn)
cat ("Sample mean = ", sam_mean, "\n")

R-output

Sample mean =  140.5635 

f) Calculate Population Mean

R-code

pop_mean = mean(OlympicMarathon$Minutes)
cat ("Population mean = ", pop_mean)

R-output

Population mean =  140.5918

g)

Sample mean = 140.5635

Population mean = 140.5918

Thus, sample mean and population mean are very close to each other

h) Standard Deviation of 10000 data points

R-code
sam_sd = sd(mn)
cat ("Standard Deviation of 10000 datapoints = ", sam_sd, "\n")

R-output

Standard Deviation of 10000 data points =  3.488243

i)

R-code

# Population standard deviation
sigma = sd(OlympicMarathon$Minutes)
# Theoretical standard deviation
theory_sd = sigma/sqrt(ssize)

cat ("Theoretical Standard Deviation = ", theory_sd, "\n")

R-output

Theoretical Standard Deviation =  3.58105 

j) Theoretical Standard Deviation = 3.58105

Standard Deviation of 10000 data points = 3.488243

These two standard deviations are also very close.

k) Graph a histogram of 10000 datapoints

R-code

#Graph Histogram
hist(mn, main = "Histogram of 10000 samples of size 5 of Minutes", xlab = "Minutes", ylab = "Frequency")

R-output

l) Histogram is skewed to the right and not a normal distribution shape

m)

For this, we change the variable ssize to 40 in all the code above and then plot the histogram

Histogram given below for sample size = 40

This is a Normal Distribution Shaped histogram

Hence, the Central Limit Theorem holds good.


Related Solutions

Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R....
Write code in R for this questions,, will vote!! Load the Taxi.txt data set into R. (a) Calculate the mean, median, standard deviation, 30th percentile, and 65th percentile for Mileage and TripTime. (b) Make a frequency table for PaymentProvider that includes a Sum column. Report the resulting table. (c) Make a contingency table comparing PaymentType and Airport. Report the resulting table. (d) Use the cor() function to find the correlation between each pair of the Meter, Tip, Mileage, and TripTime...
** Number 2 implemented in R (R Studio) ** Set up the Auto data: Load the...
** Number 2 implemented in R (R Studio) ** Set up the Auto data: Load the ISLR package and the Auto data Determine the median value for mpg Use the median to create a new column in the data set named mpglevel, which is 1 if mpg>median and otherwise is 0. Make sure this variable is a factor. We will use mpglevel as the target (response) variable for the algorithms. Use the names() function to verify that your new column...
R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2...
R Programming: Load the {ISLR} and {GGally} libraries. Load and attach the College{ISLR} data set. 1.2 Inspect the data with the ggpairs(){GGally} function, but do not run the ggpairs plots on all variables because it will take a very long time. Only include these variables in your ggpairs plot: “Outstate”,“S.F.Ratio”,“Private”,“PhD”,“Grad.Rate”. 1.3 Briefly answer: if we are interested in predicting out of state tuition (Outstate), can you tell from the plots if any of the other variables have a curvilinear relationship...
2. The data set `MLB-TeamBatting-S16.csv` contains MLB Team Batting Data for selected variables. Load the data...
2. The data set `MLB-TeamBatting-S16.csv` contains MLB Team Batting Data for selected variables. Load the data set from the given url using the code below. This data set was obtained from [Baseball Reference](https://www.baseball-reference.com/leagues/MLB/2016-standard-batting.shtml). * Tm - Team    * Lg - League: American League (AL), National League (NL) * BatAge - Batters’ average age * RPG - Runs Scored Per Game * G - Games Played or Pitched * AB - At Bats * R - Runs Scored/Allowed * H...
(Be sure to paste the R Console Output and code!!!) Using the following data and R,...
(Be sure to paste the R Console Output and code!!!) Using the following data and R, write a brief paragraph about whether the in-home treatment is equally effective as the out-of-home treatment for two separate groups. Here are the data. The outcome variable is level of anxiety after treatment on a scale from 1 to 10. In-Home Treatment Out-of-Home Treatment 3 7 4 6 1 7 1 8 1 7 3 6 3 5 6 6 5 4 1 2...
Use R statictical software. Load the ISLR package to get the Auto data set. Fit below...
Use R statictical software. Load the ISLR package to get the Auto data set. Fit below non-linear models to the Auto data set. We will treat horsepower as the predictor and mpg as the response. • Fit the cubic spline with 3 knots (25th percentile, 50th percentile, and 75th percentile of horsepower) • Fit the natural spline with 3 knots (25th percentile, 50th percentile, and 75th percentile of horsepower) • Fit the smoothing spline by choosing optimal lambda with cross-validation....
The data set “UCBAdmissions” in R contains admission decisions by gender at six departments of UC...
The data set “UCBAdmissions” in R contains admission decisions by gender at six departments of UC Berkeley. For this data set, carry out appropriate test for independence between the admission decision and gender for each of the departments. What are your conclusions? Please submit your R script with the answer.
The data file contains displacement (in mm)-load (in N) data for a mechanical test that was...
The data file contains displacement (in mm)-load (in N) data for a mechanical test that was conducted on an unknown metal. The initial length and diameter of the specimen are also given. a. (5 pts.) Using the data and a computer program (such as Excel), create an engineering stress-engineering strain graph with proper labels. The stress axis should be in the units of MPa. You do not need to show your spreadsheet or software code used to make the graph....
What are the R codes for these questions below: 1. Load the library {car}, which contains...
What are the R codes for these questions below: 1. Load the library {car}, which contains the Salaries data set. #Then, list the first few records with head(Salaries). The display the summmary() for this dataset, which will shows frequencies. Then, load the library {psych} which contains the describe() function and use this function to list the descriptive statistics for the dataset. 2. Load the coefplot library and display a coefficient plot for lm.fit.2 <- lm(salary~sex+yrs.since.phd, data=Salaries) using the coefplot() function....
What are the R codes for these questions below: 1. Load the library {car}, which contains...
What are the R codes for these questions below: 1. Load the library {car}, which contains the Salaries data set. #Then, list the first few records with head(Salaries). The display the summmary() for this dataset, which will shows frequencies. Then, load the library {psych} which contains the describe() function and use this function to list the descriptive statistics for the dataset. 2. Load the coefplot library and display a coefficient plot for lm.fit.2 <- lm(salary~sex+yrs.since.phd, data=Salaries) using the coefplot() function....
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT