In: Statistics and Probability
Given the data on scores of students final grade in statistics (in percent) determine the following statistics.
43 45 48 51 53 54 57 59 60 60 60 60 61 70 70 71 71 72 72 72 75 76 76 79 81 81 83 85 87 88 88 89 89 91 92 93 96 98 98 99 100 101 101
Assume students are only allowed to transfer the class if they receive a grade of 70 % or above. Use this fact to create a binomial distribution for students that are able to transfer and students not able to transfer the class. Do this by finding the proportion of students that receive a grade of 70 or above (this will be the value p and then q = 1 - p).
a. Determine the mean and standard deviation using the binomial distribution formulas.
b. Determine the range of usual value by finding the values that are significantly low and significantly high.
c. Use a normal continuous distribution to APPROMATE the binomial discrete probability distribution to determine the probability that at least 30 students score at a 70 or more. (Be sure to use the boundary to get the more accurate/correct answer.) Show an approximation box to verify your boundaries.
d. Use a normal continuous distribution to APPROMATE the binomial discrete probability distribution to determine the probability that exactly 30 students score at a 70 or more. (Be sure to use the boundary to get the more accurate/correct answer. Show an approximation box to verify your boundaries.
Excel Data
RStudio - RMarkdown Code
### Import the data in R
```{r}
library(xlsx)
library(tidyverse)
df <- read.xlsx(file = "data.xlsx", sheetIndex = 1)
check <- function(x) {
if (x<70) {
return("fail")
} else{
return("pass")
}
}
df$flag <- apply(df[1],1,check)
table(df$flag)
```
---
### Initial calculations
$$\begin{aligned}
& n = 13+30 = 43 \\
& \hat{p} = \frac{30}{43} = `r 30/43` \approx 0.7 \\
& \hat{q} = 1 - \hat{p} = 1 - 0.7 = 0.3
\end{aligned}$$
---
### Part a
Using properties of binomial distribution
Mean:
$$\begin{aligned}
\mu = np = 43*`r 30/43` = 30
\end{aligned}$$
Variance:
$$\begin{aligned}
\sigma^2 = npq = 43*0.7*0.3 = `r 43*0.7*0.3`
\end{aligned}$$
---
### Part b
We can use the $3\sigma$ rule to find the range for usual values.
$$\begin{aligned}
\sigma = \sqrt{\sigma^2} = \sqrt{`r 43*0.7*0.3`} = `r sqrt(43*0.7*0.3)` \approx 3
\end{aligned}$$
$$\begin{aligned}
& \mu \pm 3\sigma \\
& 30 \pm 3*3 \\
& 30 \pm 9 \\
& [21,39]
\end{aligned}$$
Therefore significantly low values will be below 21 and significantly high values will be above 39.
---
### Part c
#### Theorem - Normal Approximation to Binomial
If X is a binomial random variable with mean $\mu=np$ and variance $\sigma^2=npq$ then limiting form of the distribution of
$Z = \frac{X-np}{\sqrt{npq}}$
as $n \rightarrow \infty$, is the standard normal distribution $n(z;0,1)$.
<br>
##### Conditions
1. n is sufficiently large and p is not very close to 0 or 1.
2. Approximation works well even when n is small if p is reasonably close to 0.5.
3. General rule of thumb is to use $np \quad and \quad n(1-p) \quad \ge 10$.
$np = 43*0.7 = 30.1 > 10$
$n(1-p) = 43*0.3 = `r 43*0.3` > 10$
<br>
##### Continuity Correction
A continuity correction of 0.5 is made to the random variable X because we are approximating a discrete distribution with continuous distribution.
$$\begin{aligned}
P(X \le x) &= \sum_{x=0}^{k}{b(x;n,p)} \\
&\approx \text{area under normal curve to the left of k + 0.5} \\
&= P(Z \le \frac{k + 0.5 - np}{\sqrt{np(1-p)}})
\end{aligned}$$
<br>
```{r Continuity Correction for normal approximation to binomial}
n <- 43
p <- 0.7
k <- 30
q <- 1 - p
mu <- n*p
sd <- sqrt(n*p*q)
zs <- (k+0.5-mu)/sd
prob_binom <- pbinom(q = k, prob = p, size = n)
prob_norm <- pnorm(q = k, mean = mu, sd = sd)
prob_norm_cc <- pnorm(q = zs)
```
<br>
probability using binomial distribution for $P(X \le `r k`)$ is given by `r prob_binom`
probability using normal approximation for $P(X \le `r k`)$ is given by `r prob_norm`
probability using normal approximation with continuity correction for $P(X \le `r k`)$ is given by `r prob_norm_cc`
---
### Part d
$$\begin{aligned}
P(X = k) &= b(k;n,p) \\
&= b(30;43,0.7) \\
&\approx \text{area under normal curve between k-0.5 & k + 0.5} \\
&= P \left( \frac{k - 0.5 - np}{\sqrt{np(1-p)}} < Z < \frac{k + 0.5 - np}{\sqrt{np(1-p)}} \right) \\
&= P \left( \frac{30 - 0.5 - 30}{3} < Z < \frac{30 + 0.5 - 30}{3} \right) \\
&= P \left( `r -1/6` < Z < `r 1/6` \right) \\
\end{aligned}$$
<br>
```{r }
k2 <- 30
z_l <- (k2-0.5-mu)/sd
z_u <- (k2+0.5-mu)/sd
prob_binom_2 <- dbinom(x = k2, prob = p, size = n)
prob_norm_cc_2 <- pnorm(q = z_u) - pnorm(q = z_l)
```
<br>
probability using binomial distribution for $P(X = `r k2`)$ is given by `r prob_binom_2`
probability using normal approximation with continuity correction
$P(X = `r k2`) \approx P(`r k2 - 0.5` < X < `r k2 + 0.5`) = `r prob_norm_cc_2`$
---
### Plot for the normal approximation
```{r echo=FALSE}
library(ggplot2)
n <- 43
p <- 0.7
q <- 1 - p
mu <- n*p
sd <- sqrt(n*p*q)
x <- seq(0, n, 1)
y <- dbinom(x = x, size = n, prob = p)
df <- data.frame(x = x, y = y)
col_binom <- "#F39C12"
col_norm <- "#257CE4"
ggplot(df, aes(x = x, y = y)) +
theme_classic() +
scale_x_continuous(
name = 'X', expand = expansion(mult = c(0, 0))
) +
scale_y_continuous(
name = 'pmf - binom(x)', expand = expansion(mult = c(0, 0.05))
) +
geom_bar(stat = "identity", color = col_binom, fill = col_binom) +
geom_line(stat = "function", fun = dnorm, args = list(mean = mu, sd = sd),
color = col_norm, size = 1
)
```