There is a very nice wrapper for libconfig for python, right here https://github.com/keimoon/python-libconfig
(This is connected to the previous post on libconfig here)
Wednesday, July 17, 2013
Saturday, April 6, 2013
Why class balancing happens 'automatically' in AdaBoost
This is a very interesting characteristic of AdaBoost, that explains why, in general, it is not necessary to balance the initial weights during training for very skewed datasets.
I will use the notation from the AdaBoost Wikipedia article.
First, assume that we have $N_p$ positive samples and $N_n$ negative samples, with $N_n >> N_p$, representing a skewed dataset. Let's name the ratio between the number of negative and positive samples as $K = \frac{N_n}{N_p}$.
Now, assume that we are at the first iteration, and we have uniform weights $D_{t=1}(i) = \frac{1}{m}$ for all training samples. At the first iteration $t=1$ the goal is to find a weak learner
since all weights are equal, $\epsilon_t$ can be expressed as $\epsilon_{t} = \sum_{i=1}^{m} I(y_i \ne h_{t}(x_{i}))$, so it is just the unweighted misclassification error.
Now it comes the interesting part: assume that the weak learners in the weak learner pool are very weak, so that the lowest misclassification error $\epsilon_t$ is achieved with a weak learner that classifies all samples as negatives. (This seems unreasonable, but I have seen it happen frequently on skewed datasets).
With this in mind, $\epsilon_1 = \frac{N_p}{N_p + N_n}$. With simple math we can compute the value of $\alpha_1=\frac{1}{2} \log \frac{1 - \epsilon_1}{\epsilon_1}$ and update the distribution $D$ for iteration $t=2$, obtaining
I will use the notation from the AdaBoost Wikipedia article.
First, assume that we have $N_p$ positive samples and $N_n$ negative samples, with $N_n >> N_p$, representing a skewed dataset. Let's name the ratio between the number of negative and positive samples as $K = \frac{N_n}{N_p}$.
Now, assume that we are at the first iteration, and we have uniform weights $D_{t=1}(i) = \frac{1}{m}$ for all training samples. At the first iteration $t=1$ the goal is to find a weak learner
$h_{t} = \underset{h_{t} \in \mathcal{H}}{\operatorname{argmax}} \; \left\vert 0.5 - \epsilon_{t}\right\vert$, with $\epsilon_{t} = \sum_{i=1}^{m} D_{t}(i)I(y_i \ne h_{t}(x_{i}))$
since all weights are equal, $\epsilon_t$ can be expressed as $\epsilon_{t} = \sum_{i=1}^{m} I(y_i \ne h_{t}(x_{i}))$, so it is just the unweighted misclassification error.
Now it comes the interesting part: assume that the weak learners in the weak learner pool are very weak, so that the lowest misclassification error $\epsilon_t$ is achieved with a weak learner that classifies all samples as negatives. (This seems unreasonable, but I have seen it happen frequently on skewed datasets).
With this in mind, $\epsilon_1 = \frac{N_p}{N_p + N_n}$. With simple math we can compute the value of $\alpha_1=\frac{1}{2} \log \frac{1 - \epsilon_1}{\epsilon_1}$ and update the distribution $D$ for iteration $t=2$, obtaining
$$\frac{D_{t=2}(+)}{D_{t=2}(-)} = \frac{N_n}{N_p} = K$$
where $D_{t=2}(+)$ and $D_{t=2}(-)$ are the weight of the positive and negative samples respectively.
This means that the effect of the first weak learner was to balance the dataset, so that the sum of the positive weights equal the sum of the negative ones. Another equivalent interpretation is that the first weak learner just adds a constant to the predictive function, namely $-\alpha_1$ in this particular case.
However, there are some differences w.r.t. balancing the weights initially:
- The first weak learner is 'wasted' and could be avoided.
- If shrinkage is introduced, it is likely that many weak learners will be 'wasted'. Though 'wasted' is not very meaningful, because they would still be useful and help minimize training loss.
- The fact that the first weak learner returns -1 for all training samples does not imply that it will also for any unseen new sample. The implications of this are hard to tell though.
- I am assuming that a weak learner can return the same value for all training samples, which is the case for a decision stump, but it may not happen with other types of weak learners (e.g. svm)
- .. I am pretty sure there are other differences
Cheers!
Tuesday, February 19, 2013
C++11: compile-time lookup-table/array with constexpr
I've been trying a few of the new features of C++11, wondering how difficult it would be to create a compile-time self-populated array.
Here there is a piece of code that creates an array that contains the values of sin(0.2 x) + cos(0.5 x), completely at compile-time, and then prints it in ascii form to stdout. The most tricky part is the one containing a variadic template, which computes a set of indices from 0 to N-1.
You can see the output it produces here
Here there is a piece of code that creates an array that contains the values of sin(0.2 x) + cos(0.5 x), completely at compile-time, and then prints it in ascii form to stdout. The most tricky part is the one containing a variadic template, which computes a set of indices from 0 to N-1.
You can see the output it produces here
Subscribe to:
Posts (Atom)