Overview
Active learning query strategies were implemented by the library ALiPy (documentation).
We decided to currently integrate query strategies :
- for instance selection
- which do not require more than the developed model
Strategies currently available
Random sampling
Base method used to compare with the performance of other query strategies. It just selects the required number of queries in a random manner.
Uncertainty sampling
The uncertainty sampling is the way the learner will select queries for whom it is the most uncertain in their label. This uncertainty can be measured in different ways.
Let’s define \(x\) as an instance, \(\hat{y_i}\) corresponds to the \(i\)th most likely class predicted for the instance \(x\).
Least confident
The simplest measure calculate the difference between 100% confidence and the most confident prediction obtained for the instance \(x\) \(LC(x) = 1 - P(\hat{y_1}|x)\)
Margin
Calculates the difference between the top two most confident predictions.
\[M(x) = P(\hat{y_1}|x) - P(\hat{y_2}|x)\]In that case, the strategy will select the instance with the smallest margin, since the smaller the margin is, the most unsure the decision.
Entropy
Calculates the difference between all the prediction, as defined by information theory.
\[H(x) = -\sum_{k}P(y_k|x)\log(P(y_k|x))\]Coreset
A diversity-based method where the unlabelled instances are selected so that they are the most different from the training set, but also the most diverse among themselves. Based on the paper of Sener (2018).