Accuracy
Vacant Holes for Unsupervised Detection of the Outliers in Compact Latent Representation
Glazunov, Misha, Zarras, Apostolis
Detection of the outliers is pivotal for any machine learning model deployed and operated in real-world. It is essential for the Deep Neural Networks that were shown to be overconfident with such inputs. Moreover, even deep generative models that allow estimation of the probability density of the input fail in achieving this task. In this work, we concentrate on the specific type of these models: Variational Autoencoders (VAEs). First, we unveil a significant theoretical flaw in the assumption of the classical VAE model. Second, we enforce an accommodating topological property to the image of the deep neural mapping to the latent space: compactness to alleviate the flaw and obtain the means to provably bound the image within the determined limits by squeezing both inliers and outliers together. We enforce compactness using two approaches: (i) Alexandroff extension and (ii) fixed Lipschitz continuity constant on the mapping of the encoder of the VAEs. Finally and most importantly, we discover that the anomalous inputs predominantly tend to land on the vacant latent holes within the compact space, enabling their successful identification. For that reason, we introduce a specifically devised score for hole detection and evaluate the solution against several baseline benchmarks achieving promising results.
Cross-Domain Toxic Spans Detection
Schouten, Stefan F., Barbarestani, Baran, Tufa, Wondimagegnhue, Vossen, Piek, Markov, Ilia
Given the dynamic nature of toxic language use, automated methods for detecting toxic spans are likely to encounter distributional shift. To explore this phenomenon, we evaluate three approaches for detecting toxic spans under cross-domain conditions: lexicon-based, rationale extraction, and fine-tuned language models. Our findings indicate that a simple method using off-the-shelf lexicons performs best in the cross-domain setup. The cross-domain error analysis suggests that (1) rationale extraction methods are prone to false negatives, while (2) language models, despite performing best for the in-domain case, recall fewer explicitly toxic words than lexicons and are prone to certain types of false positives. Our code is publicly available at: https://github.com/
HomoGCL: Rethinking Homophily in Graph Contrastive Learning
Li, Wen-Zhi, Wang, Chang-Dong, Xiong, Hui, Lai, Jian-Huang
Contrastive learning (CL) has become the de-facto learning paradigm in self-supervised learning on graphs, which generally follows the "augmenting-contrasting" learning scheme. However, we observe that unlike CL in computer vision domain, CL in graph domain performs decently even without augmentation. We conduct a systematic analysis of this phenomenon and argue that homophily, i.e., the principle that "like attracts like", plays a key role in the success of graph CL. Inspired to leverage this property explicitly, we propose HomoGCL, a model-agnostic framework to expand the positive set using neighbor nodes with neighbor-specific significances. Theoretically, HomoGCL introduces a stricter lower bound of the mutual information between raw node features and node embeddings in augmented views. Furthermore, HomoGCL can be combined with existing graph CL models in a plug-and-play way with light extra computational overhead. Extensive experiments demonstrate that HomoGCL yields multiple state-of-the-art results across six public datasets and consistently brings notable performance improvements when applied to various graph CL methods. Code is avilable at https://github.com/wenzhilics/HomoGCL.
OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection
Zhang, Jingyang, Yang, Jingkang, Wang, Pengyun, Wang, Haoqi, Lin, Yueqian, Zhang, Haoran, Sun, Yiyou, Du, Xuefeng, Zhou, Kaiyang, Zhang, Wayne, Li, Yixuan, Liu, Ziwei, Chen, Yiran, Li, Hai
Out-of-Distribution (OOD) detection is critical for the reliable operation of open-world intelligent systems. Despite the emergence of an increasing number of OOD detection methods, the evaluation inconsistencies present challenges for tracking the progress in this field. OpenOOD v1 initiated the unification of the OOD detection evaluation but faced limitations in scalability and usability. In response, this paper presents OpenOOD v1.5, a significant improvement from its predecessor that ensures accurate, standardized, and user-friendly evaluation of OOD detection methodologies. Notably, OpenOOD v1.5 extends its evaluation capabilities to large-scale datasets such as ImageNet, investigates full-spectrum OOD detection which is important yet underexplored, and introduces new features including an online leaderboard and an easy-to-use evaluator. This work also contributes in-depth analysis and insights derived from comprehensive experimental results, thereby enriching the knowledge pool of OOD detection methodologies. With these enhancements, OpenOOD v1.5 aims to drive advancements and offer a more robust and comprehensive evaluation benchmark for OOD detection research.
Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language
Seidl, Philipp, Vall, Andreu, Hochreiter, Sepp, Klambauer, Gรผnter
Activity and property prediction models are the central workhorses in drug discovery and materials sciences, but currently they have to be trained or fine-tuned for new tasks. Without training or fine-tuning, scientific language models could be used for such low-data tasks through their announced zero- and few-shot capabilities. However, their predictive quality at activity prediction is lacking. In this work, we envision a novel type of activity prediction model that is able to adapt to new prediction tasks at inference time, via understanding textual information describing the task. To this end, we propose a new architecture with separate modules for chemical and natural language inputs, and a contrastive pre-training objective on data from large biochemical databases. In extensive experiments, we show that our method CLAMP yields improved predictive performance on few-shot learning benchmarks and zero-shot problems in drug discovery. We attribute the advances of our method to the modularized architecture and to our pre-training objective.
Revisiting DocRED -- Addressing the False Negative Problem in Relation Extraction
Tan, Qingyu, Xu, Lu, Bing, Lidong, Ng, Hwee Tou, Aljunied, Sharifah Mahani
The DocRED dataset is one of the most popular and widely used benchmarks for document-level relation extraction (RE). It adopts a recommend-revise annotation scheme so as to have a large-scale annotated dataset. However, we find that the annotation of DocRED is incomplete, i.e., false negative samples are prevalent. We analyze the causes and effects of the overwhelming false negative problem in the DocRED dataset. To address the shortcoming, we re-annotate 4,053 documents in the DocRED dataset by adding the missed relation triples back to the original DocRED. We name our revised DocRED dataset Re-DocRED. We conduct extensive experiments with state-of-the-art neural models on both datasets, and the experimental results show that the models trained and evaluated on our Re-DocRED achieve performance improvements of around 13 F1 points. Moreover, we conduct a comprehensive analysis to identify the potential areas for further improvement. Our dataset is publicly available at https://github.com/tonytan48/Re-DocRED.
Bootstrap aggregation and confidence measures to improve time series causal discovery
Debeire, Kevin, Runge, Jakob, Gerhardus, Andreas, Eyring, Veronika
Causal discovery methods have demonstrated the ability to identify the time series graphs representing the causal temporal dependency structure of dynamical systems. However, they do not include a measure of the confidence of the estimated links. Here, we introduce a novel bootstrap aggregation (bagging) and confidence measure method that is combined with time series causal discovery. This new method allows measuring confidence for the links of the time series graphs calculated by causal discovery methods. This is done by bootstrapping the original times series data set while preserving temporal dependencies. Next to confidence measures, aggregating the bootstrapped graphs by majority voting yields a final aggregated output graph. In this work, we combine our approach with the state-of-the-art conditional-independence-based algorithm PCMCI+. With extensive numerical experiments we empirically demonstrate that, in addition to providing confidence measures for links, Bagged-PCMCI+ improves the precision and recall of its base algorithm PCMCI+. Specifically, Bagged-PCMCI+ has a higher detection power regarding adjacencies and a higher precision in orienting contemporaneous edges while at the same time showing a lower rate of false positives. These performance improvements are especially pronounced in the more challenging settings (short time sample size, large number of variables, high autocorrelation). Our bootstrap approach can also be combined with other time series causal discovery algorithms and can be of considerable use in many real-world applications, especially when confidence measures for the links are desired.
A Hybrid Feature Selection and Construction Method for Detection of Wind Turbine Generator Heating Faults
Kavaz, Ayse Gokcen, Barutcu, Burak
Preprocessing of information is an essential step for the effective design of machine learning applications. Feature construction and selection are powerful techniques used for this aim. In this paper, a feature selection and construction approach is presented for the detection of wind turbine generator heating faults. Data were collected from Supervisory Control and Data Acquisition (SCADA) system of a wind turbine. The original features directly collected from the data collection system consist of wind characteristics, operational data, temperature measurements and status information. In addition to these original features, new features were created in the feature construction step to obtain information that can be more powerful indications of the faults. After the construction of new features, a hybrid feature selection technique was implemented to find out the most relevant features in the overall set to increase the classification accuracy and decrease the computational burden. Feature selection step consists of filter and wrapper-based parts. Filter based feature selection was applied to exclude the features which are non-discriminative and wrapper-based method was used to determine the final features considering the redundancies and mutual relations amongst them. Artificial Neural Networks were used both in the detection phase and as the induction algorithm of the wrapper-based feature selection part. The results show that, the proposed approach contributes to the fault detection system to be more reliable especially in terms of reducing the number of false fault alarms.
AQuA: A Benchmarking Tool for Label Quality Assessment
Goswami, Mononito, Sanil, Vedant, Choudhry, Arjun, Srinivasan, Arvind, Udompanyawit, Chalisa, Dubrawski, Artur
Machine learning (ML) models are only as good as the data they are trained on. But recent studies have found datasets widely used to train and evaluate ML models, e.g. ImageNet, to have pervasive labeling errors. Erroneous labels on the train set hurt ML models' ability to generalize, and they impact evaluation and model selection using the test set. Consequently, learning in the presence of labeling errors is an active area of research, yet this field lacks a comprehensive benchmark to evaluate these methods. Most of these methods are evaluated on a few computer vision datasets with significant variance in the experimental protocols. With such a large pool of methods and inconsistent evaluation, it is also unclear how ML practitioners can choose the right models to assess label quality in their data. To this end, we propose a benchmarking environment AQuA to rigorously evaluate methods that enable machine learning in the presence of label noise. We also introduce a design space to delineate concrete design choices of label error detection models. We hope that our proposed design space and benchmark enable practitioners to choose the right tools to improve their label quality and that our benchmark enables objective and rigorous evaluation of machine learning tools facing mislabeled data.
Hierarchical confusion matrix for classification performance evaluation
Riehl, Kevin, Neunteufel, Michael, Hemberg, Martin
In this work we propose a novel concept of a hierarchical confusion matrix, opening the door for popular confusion matrix based (flat) evaluation measures from binary classification problems, while considering the peculiarities of hierarchical classification problems. We develop the concept to a generalized form and prove its applicability to all types of hierarchical classification problems including directed acyclic graphs, multi path labelling, and non mandatory leaf node prediction. Finally, we use measures based on the novel confusion matrix to evaluate models within a benchmark for three real world hierarchical classification applications and compare the results to established evaluation measures. The results outline the reasonability of this approach and its usefulness to evaluate hierarchical classification problems. The implementation of hierarchical confusion matrix is available on GitHub.