In: Computer Science
Apply the classification algorithm to the following set of data records. Draw a decision tree. The class attribute is Repeat Customer.
RID |
Age |
City |
Gender |
Education |
Repeat Customer |
101 |
20..30 |
NY |
F |
College |
YES |
102 |
20..30 |
SF |
M |
Graduate |
YES |
103 |
31..40 |
NY |
F |
College |
YES |
104 |
51..60 |
NY |
F |
College |
NO |
105 |
31..40 |
LA |
M |
High school |
NO |
106 |
41..50 |
NY |
F |
College |
YES |
107 |
41..50 |
NY |
F |
Graduate |
YES |
108 |
20..30 |
LA |
M |
College |
YES |
109 |
20..30 |
NY |
F |
High school |
NO |
110 |
20..30 |
NY |
F |
college |
YES |
We start by computing the entropy for the entire set. We have 7 positive samples and 3 negative samples.
The entropy, I(7,3), is -(7/10 * log (7/10) + 3/10 * log(3/10)) = 0.88
We consider the first attribute AGE. There are 4 values for age 20..30 appears 5 times
I(s11, s21) = -(4/5 * log(4/5) + 1/5 * log(1/5)) = 0.72
31..40 appears 2 times
I(s12, s22) = -(1/2 * log(1/2) + 1/2 * log(1/2)) = 1
41..50 appears 2 times
I(s13, s23) = -(2/2 * log(2/2) = 0
51..60 appears 1 time
I(s14, s24) = -(1/1 * log(1/1) = 0
E(AGE) = 5/10 * 0.72 + 2/10 * 1 + 2/10 * 0 + 1/10 * 0 = 0.56
GAIN(AGE) = 0.88 - 0.56 = 0.32
We consider the second attribute CITY. There are 3 values for city LA occurs 2 times
I(s11, s21) = -(1/2 * log(1/2) + 1/2 * log(1/2)) = 1
NY occurs 7 times
I(s12, s22) = -(2/7 * log(2/7) + 5/7 * log(5/7)) = 0.86
SF occurs 1 times
I(s13, s23) = -(1/1 * log(1/1) = 0
E(CITY) = 2/10 * 1 + 7/10 * 0.86 + 1/10 * 0 = 0.80
GAIN(CITY) = 0.88 - 0.80 = 0.08
We consider the third attribute GENDER. There are 2 values F occurs 7 times
I(s11, s21) = -(2/7 * log(2/7) + 5/7 * log(5/7)) = 0.86
M occurs 3 times
I(s12, s22) = -(1/3 * log(1/3) + 2/3 * log(2/3)) = 0.92
E(GENDER) = 0.88
GAIN(GENDER) = 0
We consider the fourth attribute of EDUCATION. There are 3 values HS occurs 2 times
I(s11, s21) = -(2/2 * log(2/2) = 0
COLLEGE occurs 6 times
I(s12, s22) = -(1/6 * log(1/6) + 5/6 * log(5/6)) = 0.65
GRAD occurs 2 times
I(s13, s23) = -(2/2 * log(2/2) = 0
E(EDUCATION) = 0.39
GAIN(EDUCATION) = 0.49
The greatest gain is for the EDUCATION attribute.
The tree at this point would look like the following:
-------------------
| EDUCATION |
-------------------
/ | \
HS / COLLEGE | \ GRAD
/ | \
RIDS: {105,109} {101,103,104, {102,107}
same class: NO 106,108,110} same class: YES
Only the middle node is not a LEAF node, so continue with
those records and consider only the remaining attributes.
The entropy, I(5,1), is -(5/6* log (5/6) + 1/6 * log(1/6)) =
0.65
We consider the
first attribute AGE. There are 4 values for age 20..30 appears 3
times
I(s11, s21) = -(3/3 * log(3/3) = 0
31..40 appears 1
time
I(s12, s22) = -(1/1 * log(1/1) = 0
41..50 appears 1
time
I(s13, s23) = -(1/1 * log(1/1) = 0
51..60 appears 1
time
I(s14, s24) = -(1/1 * log(1/1) = 0
E(AGE) = 0
GAIN(AGE) = 0.65
We consider the
second attribute CITY. There are 2 values for city NY occurs 1
time
I(s11, s21) = -(1/1 * log(1/1) = 0
SF occurs 5
times
I(s12, s22) = -(1/5 * log(1/5) + 4/5 * log(4/5)) =
0.72
E(CITY) = 0.60
GAIN(CITY) = 0.05
We consider the
third attribute GENDER. There are 2 values F occurs 5
times
I(s11, s21) = -(1/5 * log(1/5) + 4/5 * log(4/5)) = 0.72
M occurs 1 time
I(s12, s22) = -(1/1 * log(1/1) = 0
E(GENDER) = 0.60
GAIN(GENDER) = 0.05
The greatest gain is for the AGE attribute.
The tree at this point would look like the following and we are
finished.
----------------------
| EDUCATION |
----------------------
/ | \
HS / COLLEGE | \ GRAD
/ | \
----------------
RIDS: {105,109} | AGE | {102,107}
same class: NO ---------------- same class: YES
/ / | \
/ / | \
20..30 / /31..40 |41..50 \ 51..60
{101,108,110} {103} {106} {104}
same class: YES YES YES NO