In: Computer Science
Machine learning Adaboost question:
The AdaBoost algorithm has two drawbacks. Answer the following
questions regarding these.
(I) Show mathematically why a weak learner with < 50% predictive
accuracy presents a problem to AdaBoost.
(II) AdaBoost is susceptible to outliers. Suggest a simple
heuristic that may alleviate this.
(I)
Adaboost, abbreviated for Adaptive Boosting, is an AI approach that is theoretically straightforward, however less simple to get a handle on mathematically. AdaBoost have one main disadvantage: AdaBoost is sensitive to noisy data and outliers. The idea is anything but difficult to follow, however once we attempt to burrow somewhat more profound to comprehend the number related that bolsters the idea, we become confronted with numerous articles and talks with this comparative looking sight:
Boosting: consolidating numerous frail (straightforward) students to make an exceptionally precise expectation.
Weak learners: classifiers that produce prediction that is slightly better than random guessing. Random guessing is equivalent to 50%, like flipping a coin. This will be familiar to those who are conversant with information theory, particularly the idea of Shannon’s entropy.
Adaboost: The first practical boosting algorithm invented by Freund and Schapire (1995). It is based on Vapnik and Chervonekis’ idea that for a trained classifier to be effective and accurate in its predictions, it should meet these three conditions:
1) classifier should be trained on “enough” training examples
2) it should provide a good fit to these examples by producing low training error
3) it should be simple (in that simpler models are better than overly complex one)
1) Given (x_1,y_1),… ..,(x_m,y_m) where x_i ∈ X, y_i ∈ {-1, +1}
Helpful Notations
∈: "component of"
{}: set
ex: if A = {1,2,3,7}, 2 ∈ A
(x_1, y_1): first preparing test, (x_m,y_m) = m-th preparing test
Since we have all the documentations down, we can peruse the initial segment of the equation as:
"Given the preparation set containing m tests where all x inputs are a component of the complete set X and where y yields are a component of a set including just two qualities, - 1 (negative class) and 1 (positive class)… "
2) Initialize: D1(i) = 1/m for I = 1, … ,m.
Here, D = loads of tests and I = the I-th preparing test. In different papers, the D will be composed as W. Consequently the following articulation peruses:
"… introduce all loads of your examples to 1 separated by number of preparing test… "
3) For t=1, … , T:
* train frail student utilizing circulation Dt.
* Get frail speculation h_t: X - > {-1, +1}
* Aim: select h_t with low weighted mistake:
ε = Pr_i~Dt [h_t(xi) not equivalent to y_i]
* Choose α_t = 1/2 * ln(1-ε/ε)
* Update, for I = 1,… ,m:
Dt+1(i) = Dt(i)exp(- αt * y_i * h_t(x_i)/Zt
Valuable Notations
Pr = likelihood
h_t = theory/classifier
ε = least misclassification mistake for the model
α = weight for the classifier
exp = euler's e: 2.71828
Zt = standardization factor, used to guarantee that loads speak to a genuine conveyance
With these documentations within reach, we can peruse the following bit as:
"For t=1 to T classifiers, fit it to the preparation information (where every expectation is either - 1 or 1) and select the classifier with the least weighted order mistake."
The formula to compute ε is portrayed as follows:
Let’s break down this particular model.
Useful Notations
Σ = sum
y_i not equal to h_j = 1 if misclassified and 0 if correctly classified
w_i = weight
Thus, the formula reads: “Error equals the sum of the misclassification rate, where weight for training sample i and y_i not being equal to our prediction h_j (which equals 1 if misclassified and 0 if correctly classified).”
Let us apply simple math to make sense of the formula. Consider having 4 different samples with weights 0.5, 0.2, 0.1 and 0.04. Imagine our classifier h predicted values 1, 1, -1 ,and -1, but the actual output value y was -1, 1, -1, 1.
predicted: 1 1 -1 -1
actual: -1 1 -1 1
weights: 0.5 0.2 0.1 0.04
1 or 0: 1 0 0 1
This leads to the following calculation for the misclassification rate:
misclassification rate / error = (0.5*1 + 0.2*0 + 0.1*0 + 0.04*1) / (0.5 + 0.2 + 0.1 + 0.04)
error = 0.64285714285
Next, choose our weight for the classifier, α, by the formula that reads 1/2 * ln(1- error / error).
Simple math might explain better than words could here. Assume for instance, that we have errors 0.30, 0.70, 0.5.
Our classifier weights would be calculated as follows:
ε = 0.3
α = 1/2 * ln(1- 0.3 / 0.3) = 0.42365
ε = 0.7
α = 1/2 * ln(1- 0.7 / 0.7) = -0.42365
ε = 0.5
α = 1/2 * ln(1- 0.5 / 0.5) = 0
Notice three interesting observations: 1) classifier with accuracy higher than 50% results in a positive weight for the classifier (in other words, α > 0 if ε <= 0.5), 2) classifier with exact 50% accuracy is 0, and thus, does not contribute to the final prediction, and 3) errors 0.3 and 0.7 lead to classifier weights with inverse signs.
(II)
AdaBoost can be sensitive to outliers / label noise since it is fitting a classification model (an added substance model) to an exponential loss function, and the exponential loss function is delicate to outliers/label noise.
Boosting technique learns progressively, it is important to ensure that you have quality data. AdaBoost is also extremely sensitive to Noisy data and outliers so if you do plan to use AdaBoost then it is highly recommended to eliminate them.
I.e., what this means is at each stage it is adding another weighted model to the overall classifier it is learning, with the overall objective of minimizing the exponential loss of the combined classifier on the training data. This figure comparing different loss functions (taken from the book mentioned below) helps illustrate why this exponential loss could be a problem for data with outliers / label noise:
The issue is punishments for misclassification develop dramatically with the size of the prescient capacity yield.
E.g., if something is profoundly in the positive class area yet really named as being in the negative class (because of mark commotion or being an exception), an ideal classifier's expectation f(x) for this occurrence x may bring about an extremely high certain worth, but since the real name is negative, this forecast would endure an enormous misfortune/punishment since the punishment is exponentiated (exp(- f(x)*y)). This implies this ideal classifier probably won't be the one we show up it utilizing the calculation, since it would look for one that limits the complete outstanding misfortune - so this one outlier/mislabeled point could wind up impacting the last model educated.
In a perfect world we may just need to punish a preparation occasion with a fixed worth in the event that it is misclassified (0-1 misfortune), instead of punishing it dramatically more for more prominent greatness of mis-forecast - anyway limiting 0-1 misfortune is commonly harder on the grounds that it normally results in non-curved improvement issues. There have been a few papers on utilizing different misfortune capacities with boosting that bring about less affectability to outliers and noise, similar to savage boost.