In: Statistics and Probability
Generate 2500 random numbers that are uniformly distributed between 90 and 160. Prove experimentally that STD of sample means = STD of Population/sqrt(sample size) for sample sizes of 10 and 100. How close is your calculation of STD of sample means to the theoretical approximation? Keep number of samples in each case equal to sample size. Repeat for normal and weibull (also between 90 and 160). What does it say about STD of sample means as you increase your sample size? Repeat the problem by generating 2500 random numbers with 2 peaks in its frequency distribution. Does the STD of sample means estimate formula depend on the frequency distribution of your population? For each population that is uniform, normal, weibull and 2-peaks plot the histogram to QC your population data.
Solution for the first part: (Generate 2500 random numbers that are uniformly distributed between 90 and 160. Prove experimentally that STD of sample means = STD of Population/sqrt(sample size) for sample sizes of 10)
x=runif(2500,90,160)
means = numeric(10)
for(i in 1:10)
{
vector = x[((i-1)*10+1):(i*10)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(10)
sd(means) - sd(x)/sqrt(10)
After running this code we get:
STD of sample means = 6.236872
STD of Population/sqrt(sample size) = 6.422313
We see that these are almost equal and thus prove the statement.
The difference of the STD of sample means and the theoretical approximation = -0.1854414
The Solution for the second part:(Generate 2500 random numbers that are uniformly distributed between 90 and 160. Prove experimentally that STD of sample means = STD of Population/sqrt(sample size) for sample sizes of 100)
x=runif(2500,90,160)
index = sample(1:2500,10000,replace = TRUE)
x = x[index]
#This part of the code has been done to choose 100 random
samples of size 100 from the 2500 randomly generated numbers.
means = numeric(100)
for(i in 1:100)
{
vector = x[((i-1)*100+1):(i*100)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(100)
sd(means) - sd(x)/sqrt(100)
After running this code we get:
STD of sample means = 2.010739
STD of Population/sqrt(sample size) = 2.020649
We see that these are almost equal and thus prove the statement.
The difference of the STD of sample means and the theoretical approximation = -0.009909479
The Solution for the third part: (Repeat for normal)
x=rnorm(3000,(160+90)/2,(160+90)/6)
x = x[x>90&&x<160]
x = x[1:2500]
#This part of the code is to make sure that the 2500 generated from normal distribution lie between 90 and 160. It is known that 99% of normally distributed data lie in (-3 , 3) and thus the std dev of the normal distribution has been chosen that way.
For sample size 10:
means = numeric(10)
for(i in 1:10)
{
vector = x[((i-1)*10+1):(i*10)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(10)
sd(means) - sd(x)/sqrt(10)
The STD of means = 12.40488
The STD(population)/sqrt(Sample size) = 13.17351
Their difference is -0.7686233
For sample size 100:
The R code:
x=rnorm(3000,(160+90)/2,(160+90)/6)
x = x[x>90&x<160]
x = x[1:2500]
index = sample(1:2500,10000,replace = TRUE)
x = x[index]
means = numeric(100)
for(i in 1:100)
{
vector = x[((i-1)*100+1):(i*100)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(100)
sd(means) - sd(x)/sqrt(100)
The STD of means = 3.729523
The STD(population)/sqrt(Sample size) = 4.05702
Their difference is -0.3274966
The solution for the fourth part: (Repeat for Weibull)
The R code: for sample size 10
library(distr)
d<-Truncate(Weibull(shape=1,scale=(90+160)/2),lower=90,upper=160)
x = d@r(2500)
#This part was done to generate 2500 numbers from the weibull dist between 90 and 160. The package "distr" has been used to do so.
means = numeric(10)
for(i in 1:10)
{
vector = x[((i-1)*10+1):(i*10)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(10)
sd(means) - sd(x)/sqrt(10)
The STD of means =7.067525
The STD(population)/sqrt(Sample size) = 6.22025
Their difference is 0.8472753
The R code for sample size 100:
library(distr)
d<-Truncate(Weibull(shape=1,scale=(90+160)/2),lower=90,upper=160)
x = d@r(2500)
index = sample(1:2500,10000,replace = TRUE)
x = x[index]
means = numeric(100)
for(i in 1:100)
{
vector = x[((i-1)*100+1):(i*100)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(100)
sd(means) - sd(x)/sqrt(100)
The STD of means =2.286154
The STD(population)/sqrt(Sample size) = 2.022889
Their difference is 0.263265
Solution to the Fifth part: (What does it say about STD of sample means as you increase your sample size? )
As we can see from all the above examples the STD of sample means is closer to the theoretical approximation as we increase the sample size that means that the theoretical approximation becomes more accurate.
Solution to Sixth part (Repeat for 2-peaked distribution)
The R code for sample size 10
y0 <- rlnorm(2500,mean=log(1), sd = log(3))
y1 <- rlnorm(2500,mean=log(100), sd = log(3))
flag <- rbinom(2500,size=1,prob=0.4)
y <- y0*(1 - flag) + y1*flag
x= log(y)
# This part has been done to generate a two peaked distribution. from two log normal distributions.
means = numeric(10)
for(i in 1:10)
{
vector = x[((i-1)*10+1):(i*10)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(10)
sd(means) - sd(x)/sqrt(10)
The STD of means = 0.6063173
The STD(population)/sqrt(Sample size) = 0.7917557
Their difference is -0.1854384
The R code for sample size 100:
y0 <- rlnorm(2500,mean=log(1), sd = log(3))
y1 <- rlnorm(2500,mean=log(100), sd = log(3))
flag <- rbinom(2500,size=1,prob=0.4)
y <- y0*(1 - flag) + y1*flag
x= log(y)
index = sample(1:2500,10000,replace = TRUE)
x = x[index]
means = numeric(100)
for(i in 1:100)
{
vector = x[((i-1)*100+1):(i*100)]
means[i] = mean(vector)
}
sd(means)
sd(x)/sqrt(100)
sd(means) - sd(x)/sqrt(100)
The STD of means = 0.2649177
The STD(population)/sqrt(Sample size) = 0.253053
Their difference is 0.01186469
Solution for the seventh part:(Does the STD of sample means estimate formula depend on the frequency distribution of your population?)
Yes it does. We can see from the results above that the estimate is closer to the value in some cases than the thers depending upon the frequency distribution.
Solution for the eighth part:(For each population that is uniform, normal, weibull and 2-peaks plot the histogram to QC your population data.)
Note: The histogram of normal and weibull doesnt look as expected as they are truncated to lie in between 90 and 160