In: Statistics and Probability
(1)
We use discrete uniform distribution with parameters (a=1 and b=10) to generate 7 integers with equal probability from a function which returns 1/10 i.e., probability p=1/10.
The shorthand X ∼ Discrete Uniform (a, b) is used to indicate that the random variable X has the discrete uniform distribution with integer parameters a and b, where a < b. A discrete uniform random variable X with parameters a and b has probability mass function
P(x)=1/(b-a+1), x=a, a+1, a+2,…,b
Using R software we have generated 7 integers with equal probability from a function which returns 1/10 as (7, 5, 2, 4, 8, 3, 6).
# R code to generate 7 integers with equal probability from a function which returns 1/10 . require(extraDistr) rdunif(n=7, min=1, max=10) # Discrete uniform, a=1, b=10 In order to test whether our procedure is true or not, we generate 10,0000 numbers from the same model and see the probability of each generated number, it will be close to 1/10. prop.table(table(rdunif(n=10000, min=1, max=10))) |
(2)
Receiver Operating Characteristic (ROC) Curve:
A Receiver Operating Characteristic (ROC) Curve is a way to compare diagnostic tests. It is a plot of the true positive rate against the false positive rate.
A ROC plot shows:
· The relationship between sensitivity and specificity. For example, a decrease in sensitivity results in an increase in specificity.
· Test accuracy; the closer the graph is to the top and left-hand borders, the more accurate the test. Likewise, the closer the graph to the diagonal, the less accurate the test. A perfect test would go straight from zero up the top-left corner and then straight across the horizontal.
· The likelihood ratio; given by the derivative at any particular cutpoint.
Test accuracy is also shown as the area under the curve (which you can calculate using integral calculus). The greater the area under the curve, the more accurate the test. A perfect test has an area under the ROC curve (AUROCC) of 1. The diagonal line in a ROC curve represents perfect chance. In other words, a test that follows the diagonal has no better odds of detecting something than a random flip of a coin. The area under the diagonal is 0.5 (half of the area of the graph). Therefore, a useless test (one that has no better odds than chance alone) has a AUROCC of 0.5.
A ROC curve showing two tests. The red test is closer to the diagonal and is therefore less accurate than the green test.
Sensitivity:
The sensitivity of a test (also called the true positive rate) is defined as the proportion of people with the disease who will have a positive result. In other words, a highly sensitive test is one that correctly identifies patients with a disease. A test that is 100% sensitive will identify all patients who have the disease. It’s extremely rare that any clinical test is 100% sensitive. A test with 90% sensitivity will identify 90% of patients who have the disease, but will miss 10% of patients who have the disease.
Specificity:
The specificity of a test (also called the True Negative Rate) is the proportion of people without the disease who will have a negative result. In other words, the specificity of a test refers to how well a test identifies patients who do not have a disease. A test that has 100% specificity will identify 100% of patients who do not have the disease. A test that is 90% specific will identify 90% of patients who do not have the disease.
Tests with a high specificity (a high true negative rate) are most useful when the result is positive.
Confusion matrix:
A confusion matrix, in predictive analytics, is a two-by-two table that tells us the rate of false positives, false negatives, true positives and true negatives for a test or predictor. We can make a confusion matrix if we know both the predicted values and the true values for a sample set.
In machine learning and statistical classification, a confusion matrix is a table in which predictions are represented in columns and actual status is represented by rows. Sometimes this is reversed, with actual instances in rows and predictions in columns. The table is an extension of the confusion matrix in predictive analytics, and makes it easy to see whether mislabeling has occurred and whether the predictions are more or less correct.
(3)
P-value:
P-value is the observed level of significance. It is the probability that the difference is by chance in hypothesis testing. For example, if we have to compare average sales of two shops, if the p-value is <0.05 (level of significance), it means the difference is not only due to chance but it is statistically significant. P-value >0.05 (given level of significance) means that the difference is merely due to chance and no significant difference is there.
In other words, in statistical hypothesis testing, the p-value or probability value is, for a given statistical model, the probability that, when the null hypothesis is true, the statistical summary (such as the absolute value of the sample mean difference between two compared groups) would be greater than or equal to the actual observed results.