In: Statistics and Probability
For this problem, you will run a simulation to investigate how violating the assumption of normally distributed data can affect the properties of a t-test.
a). The gamma distribution is skewed to the right. It contains a parameter called “shape”. The R function for generating data from a gamma distribution is rgamma – you can read the details in R help. Make three historgrams, each of a sample of size n = 10,000 drawn from a gamma distribution, with shape = 1, shape = 0.5, and shape = 0.1. Use “breaks = 100” to force each histogram to have lots of bars. Describe what you see happening as the shape parameter gets smaller.
b). Write a simulation that repeatedly draws two samples from a gamma distribution with shape = 1, then compares their means using a t-test. For this simulation, use n = 30 for the size of each sample. Write code that will save both the t-test statistic and p-value each time. Then make a histogram of the test statistics, and report the proportion of p-values less than 0.05. Note that, if the assumptions of the t-test are not violated, the p-value should be less than 0.05 5% of the time.
c). Do the same thing in part b. two more times, using shape = 0.5 and shape = 0.1. Does this seem to have any effect on the distribution of the test statistics, or the proportion of p- values less than 0.05?
d). Run the simulation three more times (once for each value of shape), using samples of size n = 10 rather than n = 30. Show the three histograms and three proportions of p-values less than 0.05. Did this have any noticeable effect on the results?
Answer:-
Given That:-
For this problem, you will run a simulation to investigate how violating the assumption of normally distributed data can affect the properties of a t-test.
a). The gamma distribution is skewed to the right. It contains a parameter called “shape”. The R function for generating data from a gamma distribution is rgamma – you can read the details in R help. Make three historgrams, each of a sample of size n = 10,000 drawn from a gamma distribution, with shape = 1, shape = 0.5, and shape = 0.1. Use “breaks = 100” to force each histogram to have lots of bars. Describe what you see happening as the shape parameter gets smaller.
R CODE:
d1<- rgamma(n=10000,shape=1)
d2<- rgamma(n=10000,shape=0.5)
d3<- rgamma(n=10000,shape=0.1)
hist(d1,breaks=100)
windows()
hist(d2,breaks=100)
windows()
hist(d3,breaks=100)
windows()
R OUTPUT:
As the shape -parameter decreases, the variability in the data and the skewness increases.
b). Write a simulation that repeatedly draws two samples from a gamma distribution with shape = 1, then compares their means using a t-test. For this simulation, use n = 30 for the size of each sample. Write code that will save both the t-test statistic and p-value each time. Then make a histogram of the test statistics, and report the proportion of p-values less than 0.05. Note that, if the assumptions of the t-test are not violated, the p-value should be less than 0.05 5% of the time.
R CODE:
count=0
for(i in 1:10000){
t=t.test(rgamma(n=30,shape=1),rgamma(n=30,shape=1),alternative="two.sided")
p[i]=t$p.value
s[i]=t$statistic
if(p[i]<0.05){
count=count+1
}
}
count/10000
hist(s,breaks=100)
windows()
R OUTPUT:
The proportion of p-values less than 0.05 is obtained as 0.0446 which is less than 5%. Hence it confirms the conditions.
The histograph appears to be more or less mesokurtic and symmetric, following Central Limit Theorem
c). Do the same thing in part b. two more times, using shape = 0.5 and shape = 0.1. Does this seem to have any effect on the distribution of the test statistics, or the proportion of p- values less than 0.05?
R CODE:
count=0
for(i in 1:10000){
t=t.test(rgamma(n=30,shape=1),rgamma(n=30,shape=1),alternative="two.sided")
p[i]=t$p.value
s[i]=t$statistic
}
test1=t.test(rgamma(n=30,shape=0.5),rgamma(n=30,shape=0.5),alternative="two.sided")
test2=t.test(rgamma(n=30,shape=0.1),rgamma(n=30,shape=0.1),alternative="two.sided")
p[10001]=test1$p.value
p[10002]=test2$p.value
s[10001]=test1$statistic
s[10002]=test2$statistic
for(i in 1:10002){
if(p[i]<0.05){
count=count+1
}
}
count/10002
hist(s,breaks=100)
R OUTPUT:
This does not affect much the distribution of the test statistic as it is evident from the histogram being still mesokurtic and symmetric.
The proportion of p-values less than 0.05 is obtained to be less
than 5%. Hence it does not affect the proportion of p-values less
than 0.05.
d). Run the simulation three more times (once for each value of shape), using samples of size n = 10 rather than n = 30. Show the three histograms and three proportions of p-values less than 0.05. Did this have any noticeable effect on the results?
R CODE:
count=0
for(i in 1:10000){
t=t.test(rgamma(n=30,shape=1),rgamma(n=30,shape=1),alternative="two.sided")
p[i]=t$p.value
s[i]=t$statistic
}
test1=t.test(rgamma(n=30,shape=0.5),rgamma(n=30,shape=0.5),alternative="two.sided")
test2=t.test(rgamma(n=30,shape=0.1),rgamma(n=30,shape=0.1),alternative="two.sided")
test3=t.test(rgamma(n=10,shape=1),rgamma(n=10,shape=1),alternative="two.sided")
test4=t.test(rgamma(n=10,shape=0.5),rgamma(n=10,shape=0.5),alternative="two.sided")
test5=t.test(rgamma(n=10,shape=0.1),rgamma(n=10,shape=0.1),alternative="two.sided")
p[10001]=test1$p.value
p[10002]=test2$p.value
p[10003]=test3$p.value
p[10004]=test4$p.value
p[10005]=test5$p.value
s[10001]=test1$statistic
s[10002]=test2$statistic
s[10003]=test3$statistic
s[10004]=test4$statistic
s[10005]=test5$statistic
for(i in 1:10005){
if(p[i]<0.05){
count=count+1
}
}
count/10005
hist(s,breaks=100)
R OUTPUT:
[1] 0.04757621
This again does not affect much the distribution of the test
statistic as it is evident from the histogram being still
mesokurtic and symmetric.
The proportion of p-values less than 0.05 is obtained to be less than 5%. Hence it does not affect the proportion of p-values less than 0.05.
Hopefully this will help you. In case of any query, do comment. If you are satisfied with the answer, give it a like.Thanks.