In: Statistics and Probability
Answer/discuss briefly the following questions.
(a) What is the motivation for using standardized lift in association rule learning?
(b) Using your own words, describe the appropriate process for evaluating and comparing multiple classifiers.
(c) Discuss and compare advantages and disadvantages of classification trees, random forests and boosting.
Answer
a) The lift value in an association rule idenotes the measure of importance of a rule. It is a major performance indicator of a target model to predict or classify data poinmts against a random choice target model.
Standardising lift is a technique which takes into account the range of values lift may take to increase the effectiveness of the association rule. Using standardised lift to rank association rules has the effect of ranking a rule depending on the relative position of its lift to the maximum and minimum potential values. This results in a natural and absolute method of ranking association rules.
b) Processes to evaluate multiple classifiers are as mentioned below: Confusion Matrix, ROC curve, Area under the curve (AUC), Concordance, discordance etc
The best and most relevant process which is used in Confusion matrix. Below is the basic layout of the same
Truth | |||
+ | - | ||
Predicted | + | True Positives (TP) | False Positives (FP) |
- | False Negatives (FN) | True Negatives (TN) |
Many parameters can be calculated using confusion matrix:
1) Sensitivity / True Positive Rate
2) Specificity / False Positive Rate
3) Precision / Positive predictive value
4) Negative Predictive value
5) Many others like False discovery rate, False ommision rate etc
c) Below is the answer
Advantages | Disadvantages | |
Classification Tree | It doesn’t require normalization and scaling of data | Sometimes the calculation in Classification and decision trees is complex as compared to other algorithms |
Requires less effort at Data preparation stage | Due to high complexity, it may be expensive and time consuming | |
Self explanatory and easily explanable to technical teams and business leads | Not a good technique for regression and predicting continuous output | |
Random Forest | The prediction performance is one of the best in all the supervised learning techniques | Randon forest model is inherently less interpretable than an classification tree |
Can be used for both classifictaion and regression problems | High computational costs | |
Robust to outliers | Training a large number of trees, may result in slow prediction | |
Boosting | Curbs Over-fitting of data | Sensitive to outliers |
Easy to interpret algorithm | Time and computationally expensive |