Saturday, April 6, 2013

Why class balancing happens 'automatically' in AdaBoost

This is a very interesting characteristic of AdaBoost, that explains why, in general, it is not necessary to balance the initial weights during training for very skewed datasets.

I will use the notation from the AdaBoost Wikipedia article.

First, assume that we have $N_p$ positive samples and $N_n$ negative samples, with $N_n >> N_p$, representing a skewed dataset. Let's name the ratio between the number of negative and positive samples as $K = \frac{N_n}{N_p}$.

Now, assume that we are at the first iteration, and we have uniform weights $D_{t=1}(i) = \frac{1}{m}$ for all training samples. At the first iteration $t=1$ the goal is to find a weak learner

$h_{t} = \underset{h_{t} \in \mathcal{H}}{\operatorname{argmax}} \; \left\vert 0.5 - \epsilon_{t}\right\vert$, with $\epsilon_{t} = \sum_{i=1}^{m} D_{t}(i)I(y_i \ne h_{t}(x_{i}))$

since all weights are equal, $\epsilon_t$ can be expressed as $\epsilon_{t} = \sum_{i=1}^{m} I(y_i \ne h_{t}(x_{i}))$, so it is just the unweighted misclassification error.

Now it comes the interesting part: assume that the weak learners in the weak learner pool are very weak, so that the lowest misclassification error $\epsilon_t$ is achieved with a weak learner that classifies all samples as negatives. (This seems unreasonable, but I have seen it happen frequently on skewed datasets).

With this in mind, $\epsilon_1 = \frac{N_p}{N_p + N_n}$. With simple math we can compute the value of $\alpha_1=\frac{1}{2} \log \frac{1 - \epsilon_1}{\epsilon_1}$ and update the distribution $D$ for iteration $t=2$, obtaining
$$\frac{D_{t=2}(+)}{D_{t=2}(-)} = \frac{N_n}{N_p} = K$$

where $D_{t=2}(+)$ and $D_{t=2}(-)$ are the weight of the positive and negative samples respectively.

This means that the effect of the first weak learner was to balance the dataset, so that the sum of the positive weights equal the sum of the negative onesAnother equivalent interpretation is that the first weak learner just adds a constant to the predictive function, namely $-\alpha_1$ in this particular case.

However, there are some differences w.r.t. balancing the weights initially:
  • The first weak learner is 'wasted' and could be avoided.
  • If shrinkage is introduced, it is likely that many weak learners will be 'wasted'. Though 'wasted' is not very meaningful, because they would still be useful and help minimize training loss.
  • The fact that the first weak learner returns -1 for all training samples does not imply that it will also for any unseen new sample. The implications of this are hard to tell though.
  • I am assuming that a weak learner can return the same value for all training samples, which is the case for a decision stump, but it may not happen with other types of weak learners (e.g. svm)
  • .. I am pretty sure there are other differences