I will use the notation from the AdaBoost Wikipedia article.

First, assume that we have $N_p$ positive samples and $N_n$ negative samples, with $N_n >> N_p$, representing a skewed dataset. Let's name the ratio between the number of negative and positive samples as $K = \frac{N_n}{N_p}$.

Now, assume that we are at the first iteration, and we have uniform weights $D_{t=1}(i) = \frac{1}{m}$ for all training samples. At the first iteration $t=1$ the goal is to find a weak learner

$h_{t} = \underset{h_{t} \in \mathcal{H}}{\operatorname{argmax}} \; \left\vert 0.5 - \epsilon_{t}\right\vert$, with $\epsilon_{t} = \sum_{i=1}^{m} D_{t}(i)I(y_i \ne h_{t}(x_{i}))$

since all weights are equal, $\epsilon_t$ can be expressed as $\epsilon_{t} = \sum_{i=1}^{m} I(y_i \ne h_{t}(x_{i}))$, so it is just the unweighted misclassification error.

Now it comes the interesting part: assume that the weak learners in the weak learner pool are very weak, so that

*the lowest misclassification error*$\epsilon_t$

*is achieved with a weak learner that classifies all samples as negatives*. (This seems unreasonable, but I have seen it happen frequently on skewed datasets).

With this in mind, $\epsilon_1 = \frac{N_p}{N_p + N_n}$. With simple math we can compute the value of $\alpha_1=\frac{1}{2} \log \frac{1 - \epsilon_1}{\epsilon_1}$ and update the distribution $D$ for iteration $t=2$, obtaining

$$\frac{D_{t=2}(+)}{D_{t=2}(-)} = \frac{N_n}{N_p} = K$$

where $D_{t=2}(+)$ and $D_{t=2}(-)$ are the weight of the positive and negative samples respectively.

This means that

**the effect of the first weak learner was to balance the dataset, so that the sum of the positive weights equal the sum of the negative ones**. Another equivalent interpretation is that the first weak learner just adds a constant to the predictive function, namely $-\alpha_1$ in this particular case.
However, there are some differences w.r.t. balancing the weights initially:

- The first weak learner is 'wasted' and could be avoided.
- If shrinkage is introduced, it is likely that many weak learners will be 'wasted'. Though 'wasted' is not very meaningful, because they would still be useful and help minimize training loss.
- The fact that the first weak learner returns -1 for all training samples does not imply that it will also for any unseen new sample. The implications of this are hard to tell though.
- I am assuming that a weak learner can return the same value for all training samples, which is the case for a decision stump, but it may not happen with other types of weak learners (e.g. svm)
- .. I am pretty sure there are other differences

Cheers!

## No comments:

## Post a Comment