In: Statistics and Probability
Q. In this question, you will do some resampling and show results in graphics. This is related to bootstrap technique. The population distribution is a normal with µ = 10 and σ^2 = 4. The statistic is the sample mean. Hence in theory we know exactly what the density function of the sample mean is.
(a) Simulate a sample, say x, with sample size n=100. Report its mean, sd, min, and max.
(b) Use R functions sample and replicate to resample x 50000 times with replacement. The statistic is the sample mean and the output is booted.data. Find the mean, sd, min, and max of booted.data.
(c) Plot the histogram of booted.data. Please double the cells of histogram since the default one is too small. Please plot as a density plot since the theoretical density will be added in the next step. Comment the shape and center of this distribution.
(d) Plot the histogram of booted.data-mean(x) with twice number of default cells. Please plot as a density plot. Add the theoretical density function of the X¯ −µ to the histogram with different line type and color. Comment out your findings.
(e) Repeat the procedures from (a) to (d) two additional times to check consistency.
Let X be a Random variable having normal distribution with mean and variance
Let be a randomly selected sample of size n=100. Using the central limit theorem, we know that the theoretical distribution of is normally distributed with mean and standard deviation (or the standard error of mean)
That is
(a) Simulate a sample, say x, with sample size n=100. Report its mean, sd, min, and max.
R code with comments
---
#set the random seed
set.seed(123)
#set the sample size
n<-100
#a) simulate x from normal(10,4)
x<-rnorm(n,mean=10,sd=2)
#report mean, sd, min, and max.
sprintf('The sample mean:%.4f sd:%.4f min:%.4f
max:%.4f',mean(x),sd(x),min(x),max(x))
----
get this
(b) Use R functions sample and replicate to resample x 50000 times with replacement. The statistic is the sample mean and the output is booted.data. Find the mean, sd, min, and max of booted.data.
R code with comments
----
#b)
#sample x with replacement 50000 times and find the sample
mean
booted.data<-replicate(50000,mean(sample(x,size=n,replace=TRUE)))
#report mean, sd, min, and max.
sprintf('The sample mean:%.4f sd:%.4f min:%.4f max:%.4f',
mean(booted.data),sd(booted.data),min(booted.data),max(booted.data))
----
get this
(c) Plot the histogram of booted.data. Please double the cells of histogram since the default one is too small. Please plot as a density plot since the theoretical density will be added in the next step. Comment the shape and center of this distribution.
R code with comments
---
#c)plot the density histogram of booted.data
hist(booted.data,breaks=30,freq=FALSE)
---
get this
We can see that the histogram has a bell shape centered at around 10.2. This is as expected due to the central limit theorem, that the sampling distribution of sample mean has a normal distribution.
d) Plot the histogram of booted.data-mean(x) with twice number of default cells. Please plot as a density plot. Add the theoretical density function of the X¯ −µ to the histogram with different line type and color. Comment out your findings.
we have already seen that the sample mean of a sample of size n=100 has
Hence the distribution of is normal with mean=0 and standard deviation = 0.2
R code with comments
---
#d) histogram of booted.data-mean(x)
hist(booted.data-mean(x),breaks=30,freq=FALSE)
#add the theoretical distribution of Xbar-mu
curve(dnorm(x,0,0.2),from=min(booted.data)-mean(x),to=max(booted.data)-mean(x),add=TRUE,col="red",lty=2)
----
get this
We can see that the theoretical distribution of indicated by the dotted red line matches the density histogram of booted.data-mean(x), hence supporting the theory.
(e) Repeat the procedures from (a) to (d) two additional times to check consistency.
We will change the random seed to get a different solution
All the code together is
----
#set the random seed
set.seed(124)
#set the sample size
n<-100
#a) simulate x from normal(10,4)
x<-rnorm(n,mean=10,sd=2)
#report mean, sd, min, and max.
sprintf('The sample mean:%.4f sd:%.4f min:%.4f
max:%.4f',mean(x),sd(x),min(x),max(x))
#b)
#sample x with replacement 50000 times and find the sample
mean
booted.data<-replicate(50000,mean(sample(x,size=n,replace=TRUE)))
#report mean, sd, min, and max.
sprintf('The sample mean:%.4f sd:%.4f min:%.4f max:%.4f',
mean(booted.data),sd(booted.data),min(booted.data),max(booted.data))
#make way for 2 graphs
par(mfrow=c(2,1))
#c)plot the density histogram of booted.data
hist(booted.data,breaks=30,freq=FALSE)
#d) histogram of booted.data-mean(x)
hist(booted.data-mean(x),breaks=30,freq=FALSE)
#add the theoretical distribution of Xbar-mu
curve(dnorm(x,0,0.2),from=min(booted.data)-mean(x),to=max(booted.data)-mean(x),add=TRUE,col="red",lty=2)
---
output with seed(124)
plots
run #3, with seed(125)
and the plot
We can see that the observations that we made for run #1 still holds good for these 2 runs. and hence the results are consistent with theory.