In: Statistics and Probability
The birthday problem considers the probability that two people in a group of a given size have the same birth date. We will assume a 365 day year (no leap year birthdays).
Code set-up
Dobrow 2.28 provides useful R code for simulating the birthday problem. Imagine we want to obtain an empirical estimate of the probability that two people in a class of a given size will have the same birth date. The code
trial = sample(1:365, numstudents, replace=TRUE)
simulates birthdays from a group of numstudents students. So you can assign numstudents or just replace numstudents with the number of students in the class of interest.
If we store the list of birthdays in the variable trial, the code
2 %in% table(trial)
will create a frequency table of birthdays and then determine if there is a match (2 birthdays the same). We can use this code in an if-else statement to record whether a class has at least one pair of students with the same birth date. We then can embed the code within a for-loop to repeat the experiment, store successes in a vector, and then take the average number of successes (a birthday match) across the repeated tasks.
The problems
Recall that the true probability is 1-prod(seq(343,365))/(365)^23 which is approximately 50%.
# [Place code here]
Place your answers to the three items below here:
The birthday problem considers the probability that two people in a group of a given size have the same birth date. We will assume a 365 day year (no leap year birthdays).
Code set-up
Dobrow 2.28 provides useful R code for simulating the birthday problem. Imagine we want to obtain an empirical estimate of the probability that two people in a class of a given size will have the same birth date. The code
trial = sample(1:365, numstudents, replace=TRUE)
simulates birthdays from a group of numstudents students. So you can assign numstudents or just replace numstudents with the number of students in the class of interest.
If we store the list of birthdays in the variable trial, the code
2 %in% table(trial)
will create a frequency table of birthdays and then determine if there is a match (2 birthdays the same). We can use this code in an if-else statement to record whether a class has at least one pair of students with the same birth date. We then can embed the code within a for-loop to repeat the experiment, store successes in a vector, and then take the average number of successes (a birthday match) across the repeated tasks.
The problems
Recall that the true probability is 1-prod(seq(343,365))/(365)^23 which is approximately 50%.
# [Place code here]
Place your answers to the three items below here:
1)
For each class of 23 students, we can simulate their birthdays in a 365 day year using the code
trial = sample(1:365, 23, replace=TRUE)
Then use
table(trial)
to find the frequencies of each date. We need to check if any of these frequencies are greater than 1. We do that with,
max(table(trial)) > 1
This will give a TRUE or FALSE answer which we can store in an array. Now for the simulation, we put everything a for-loop and store the result in an array called success.
numstudents = 23
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
Then count the proportion of TRUE in the array success using
sum(success)/1000
Running that code with seed set to 123 gives the answer 0.515, which is very close to 50%.
2)
Now, to find the number of students required in a class, we run the code for different values of the variable 'numstudents' and record the results.
For
numstudents = 30
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
gives 0.693 so we need to go higher. Let's try 40.
numstudents = 40
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
And we get 0.898. So higher still.
At numstudents = 50 we get 0.97. So we need to go lower.
At numstudents = 45 we get 0.938. So a little higher.
At numstudents = 47 we get 0.954. So let's check 46.
At numstudents = 46 we get 0.943.
Thus, 47 is the smallest class size that gives the estimated probability as greater than 95%.
Note that this code is only reproducible with the right seed and you will get slightly different values otherwise.
3)
For a class size of 50 we find the trial array again, and then in table(trial) we must check if any value is at least 3 or greater than 2. So the code is:
numstudents = 50
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>2)
}
sum(success)/1000
And I got the output 0.128.
Code:
set.seed(123)
numstudents = 23
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 30
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 40
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 50
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 45
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 47
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 46
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>1)
}
sum(success)/1000
numstudents = 50
success = c()
for (i in 1:1000) {
trial = sample(1:365, numstudents, replace=TRUE)
success[i]<- (max(table(trial))>2)
}
sum(success)/1000
0.515
47
0.128