In: Statistics and Probability
There are two candidate RNAs for COVID-19 diagnosis: RNA1, RNA2. Canadian Disease Control Center carried out a clinical trial to check the expression levels for these two RNAs in the subjects with the virus infection: one group of 50 randomly recruited subjects has no critical symptoms; and the other group of 50 subjects has symptoms. After normalization, RNA1 expression levels follow a normal distribution N(0,1) for no-symptom subjects while N(1,1) for subjects with symptoms requiring hospitalization. For RNA2, the corresponding expression levels in nonsymptom subjects and subjects with symptoms follow normal distributions N(0,1) and N(-1,1), respectively.
a. For one breast cancer patient with normalized RNA1 expression
level at 2, what is the log-likelihood ratio (LLR) of this patient
being diagnosed to be hospitalized? (3 pts)
b. Taking naive Bayes classifier, if we know RNA1=2, RNA2 = 1, what
will be the naive Bayes score of the patient being hospitalized? (3
pts)
c. What is the basic assumption of naive Bayes classifier? Under
what situations, it may be problematic? (4 pts)
a. infection by the virus can be provisionally diagnosed on the basis of symptoms ,though confirmation is ultimately by reverse transcription polymerase chain reaction(rRT-PCR) of infected secretions (71% sensitivity)or CT imaging (98% sensitivity).
A person is considered at risk if they have travelled to an area with ongoing community transmission within the previous 14 days , or have had close contact with an infected person.
common key indicators include fever,coughing, and shortness of breath. other possible indicators include fatigue, myalgia,anorexia,sputum production, and sore throat.
b. it is easy and fast to predict the class of the test data set.it also performs well in multiclass prediction.
when assumption of independence holds ,a naive bayes classifier performs better compare to other models like logistic regression and you need less training data.
it perform well in case of categorical input variables compared to numerical variable(s) .for numerical variable ,normal distribution is assumed (bell curve, which is a strong assumption).
c. Naive bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. this assumption is called class conditional independence.p (c/x) is the posterior probability of class ( target) given predictor ( attribute)
Naive bayes is so called because the independence assumptions we have just made are indeed very naive for a model of natural language. the conditional independence assumption states that features are independent of each other given the class.this is hardly ever true for terms in documents.
A subtle issue ("disadvantage" if you like) with naive bayes is that if you have no occurences of a class label and a certain attribute value together ( e.g. class="nice",shape="sphere") then the frequency based probability estimate will be zero .given naive bayes conditional independence assumption, when all the probabilities are multiplied you will get zero and this will affect the posterior probability estimate.
this problem happens when we are drawing samples from a population and the drawn vectors are not fully representative of the population.lagrange correction and the other schemes have been proposed to avoid this undesirable situation.