In: Math
Naive Bayes Theorem
See the dataset D in Table 1. It consists of clinical data about 14 patients. Using the data in D, determine the Naive Bayes classifier and predict the patients in Table 2. Then, compare with your ‘predicted’ ones with the ground-truth label (i.e., column ’Disease’) and report the accuracy P.
Table 1: Dataset D with clinical data of 14 patients
ID |
HBP |
BMI |
Drink |
Weight |
Disease |
1 |
“Yes” |
“Normal” |
“No” |
“Overweight” |
“Yes” |
2 |
“No” |
“Normal” |
“Yes” |
“Normal” |
“No” |
3 |
“No” |
“Critical” |
“No” |
“Overweight” |
“Yes” |
4 |
“No” |
“High” |
“Yes” |
“Overweight” |
“Yes” |
5 |
“Yes” |
“Critical” |
“Yes” |
“Obese” |
“Yes” |
6 |
“Yes” |
“High” |
“Yes” |
“Normal” |
“Yes” |
7 |
“No” |
“High” |
“No” |
“Obese” |
“No” |
8 |
“Yes” |
“Normal” |
“Yes” |
“Normal” |
“Yes” |
9 |
“Yes” |
“Critical” |
“No” |
“Obese” |
“Yes” |
10 |
“No” |
“Normal” |
“No” |
“Overweight” |
“No” |
11 |
“No” |
“Critical” |
“Yes” |
“Normal” |
“Yes” |
12 |
“Yes” |
“High” |
“No” |
“Overweight” |
“No” |
13 |
“Yes” |
“Normal” |
“Yes” |
“Overweight” |
“Yes” |
14 |
“Yes” |
“High” |
“No” |
“Obese” |
“No” |
Table 2: Test data with additional 5 patients
ID |
HBP |
BMI |
Drink |
Weight |
Disease |
15 |
“Yes” |
“Normal” |
“No” |
“Overweight” |
“Yes” |
16 |
“No” |
“Normal” |
“Yes” |
“Normal” |
“No” |
17 |
“No” |
“Critical” |
“No” |
“Overweight” |
“Yes” |
18 |
“No” |
“High” |
“Yes” |
“Overweight” |
“Yes” |
19 |
“Yes” |
“Critical” |
“Yes” |
“Obese” |
“Yes” |
In table 1 from Disease column, P(Yes) = 9/14, P(No) = 5/14
The table 1 can be split as (For Yes and No):
Id | HBP | BMI | Drink | Weight | Disease | Id | HBP | BMI | Drink | Weight | Disease | |
1 | Yes | Normal | No | Overweight | Yes | 2 | No | Normal | Yes | Normal | No | |
3 | No | Critical | No | Overweight | Yes | 7 | No | High | No | Obese | No | |
4 | No | High | Yes | Overweight | Yes | 10 | No | Normal | No | Overweight | No | |
5 | Yes | Critical | Yes | Obese | Yes | 12 | Yes | High | No | Overweight | No | |
6 | Yes | High | Yes | Normal | Yes | 14 | Yes | High | No | Obese | No | |
8 | Yes | Normal | Yes | Normal | Yes | |||||||
9 | Yes | Critical | No | Obese | Yes | |||||||
11 | No | Critical | Yes | Normal | Yes | |||||||
13 | Yes | Normal | Yes | Overweight | Yes |
Now, from test data Table 2,
Row 1: P(X) = (8/14)*(5/14)*(7/14)*(6/14) = 0.044.
P(Disease = Yes | X) = 0.0000344/0.044 = 0.00078
P(Disease = No | X) = 0/0.044 = 0
Predicted Row 1: Yes
Similarly,
since, the probability P(X | Disease = No) P(Disease = No) = 0, for rest all other rows the prediction for the disease will be ‘Yes’ only. Hence ID 16 is misclassified here.
So, the accuracy = (4/5)*100 = 80%