In multi-label learning, each object is represented by a single instance while associated with a set of class labels. Due to the huge (exponential) number of possible label sets for prediction, existing approaches mainly focus on how to exploit label correlations to facilitate the learning process. Nevertheless, an intrinsic characteristic of learning from multi-label data, i.e. the widely-existing class-imbalance among labels, has not been well investigated. Generally, the number of positive training instances w.r.t. each class label is far less than its negative counterparts, which may lead to performance degradation for most multi-label learning techniques. In this paper, a new multi-label learning approach named Cross-Coupling Aggregation (COCOA) is proposed, which aims at leveraging the exploitation of label correlations as well as the exploration of class-imbalance. Briefly, to induce the predictive model on each class label, one binary-class imbalance learner corresponding to the current label and several multi-class imbalance learners coupling with other labels are aggregated for prediction. Extensive experiments clearly validate the effectiveness of the proposed approach, especially in terms of imbalance-specific evaluation metrics such as F-measure and area under the ROC curve.
Being able to model correlations between labels is considered crucial in multi-label classification. Rule-based models enable to expose such dependencies, e.g., implications, subsumptions, or exclusions, in an interpretable and human-comprehensible manner. Albeit the number of possible label combinations increases exponentially with the number of available labels, it has been shown that rules with multiple labels in their heads, which are a natural form to model local label dependencies, can be induced efficiently by exploiting certain properties of rule evaluation measures and pruning the label search space accordingly. However, experiments have revealed that multi-label heads are unlikely to be learned by existing methods due to their restrictiveness. To overcome this limitation, we propose a plug-in approach that relaxes the search space pruning used by existing methods in order to introduce a bias towards larger multi-label heads resulting in more expressive rules. We further demonstrate the effectiveness of our approach empirically and show that it does not come with drawbacks in terms of training time or predictive performance.
Recently, several authors have advocated the use of rule learning algorithms to model multi-label data, as rules are interpretable and can be comprehended, analyzed, or qualitatively evaluated by domain experts. Many rule learning algorithms employ a heuristic-guided search for rules that model regularities contained in the training data and it is commonly accepted that the choice of the heuristic has a significant impact on the predictive performance of the learner. Whereas the properties of rule learning heuristics have been studied in the realm of single-label classification, there is no such work taking into account the particularities of multi-label classification. This is surprising, as the quality of multi-label predictions is usually assessed in terms of a variety of different, potentially competing, performance measures that cannot all be optimized by a single learner at the same time. In this work, we show empirically that it is crucial to trade off the consistency and coverage of rules differently, depending on which multi-label measure should be optimized by a model. Based on these findings, we emphasize the need for configurable learners that can flexibly use different heuristics. As our experiments reveal, the choice of the heuristic is not straight-forward, because a search for rules that optimize a measure locally does usually not result in a model that maximizes that measure globally.
Exploiting dependencies between labels is considered to be crucial for multi-label classification. Rules are able to expose label dependencies such as implications, subsumptions or exclusions in a human-comprehensible and interpretable manner. However, the induction of rules with multiple labels in the head is particularly challenging, as the number of label combinations which must be taken into account for each rule grows exponentially with the number of available labels. To overcome this limitation, algorithms for exhaustive rule mining typically use properties such as anti-monotonicity or decomposability in order to prune the search space. In the present paper, we examine whether commonly used multi-label evaluation metrics satisfy these properties and therefore are suited to prune the search space for multi-label heads.
The objective in extreme multi-label learning is to train a classifier that can automatically tag a novel data point with the most relevant subset of labels from an extremely large label set. Embedding based approaches make training and prediction tractable by assuming that the training label matrix is low-rank and hence the effective number of labels can be reduced by projecting the high dimensional label vectors onto a low dimensional linear subspace. Still, leading embedding approaches have been unable to deliver high prediction accuracies or scale to large problems as the low rank assumption is violated in most real world applications. This paper develops the X-One classifier to address both limitations. The main technical contribution in X-One is a formulation for learning a small ensemble of local distance preserving embeddings which can accurately predict infrequently occurring (tail) labels. This allows X-One to break free of the traditional low-rank assumption and boost classification accuracy by learning embeddings which preserve pairwise distances between only the nearest label vectors. We conducted extensive experiments on several real-world as well as benchmark data sets and compared our method against state-of-the-art methods for extreme multi-label classification. Experiments reveal that X-One can make significantly more accurate predictions then the state-of-the-art methods including both embeddings (by as much as 35%) as well as trees (by as much as 6%). X-One can also scale efficiently to data sets with a million labels which are beyond the pale of leading embedding methods.