Goto

Collaborating Authors

 Ritzert, Martin


Hierarchical clustering with maximum density paths and mixture models

arXiv.org Machine Learning

Hierarchical clustering is an effective and interpretable technique for analyzing structure in data, offering a nuanced understanding by revealing insights at multiple scales and resolutions. It is particularly helpful in settings where the exact number of clusters is unknown, and provides a robust framework for exploring complex datasets. Additionally, hierarchical clustering can uncover inner structures within clusters, capturing subtle relationships and nested patterns that may be obscured by traditional flat clustering methods. However, existing hierarchical clustering methods struggle with high-dimensional data, especially when there are no clear density gaps between modes. Our method addresses this limitation by leveraging a two-stage approach, first employing a Gaussian or Student's t mixture model to overcluster the data, and then hierarchically merging clusters based on the induced density landscape. This approach yields state-of-the-art clustering performance while also providing a meaningful hierarchy, making it a valuable tool for exploratory data analysis. Code is available at https://github.com/ecker-lab/tneb clustering.


DISCO: Internal Evaluation of Density-Based Clustering

arXiv.org Machine Learning

In density-based clustering, clusters are areas of high object density separated by lower object density areas. This notion supports arbitrarily shaped clusters and automatic detection of noise points that do not belong to any cluster. However, it is challenging to adequately evaluate the quality of density-based clustering results. Even though some existing cluster validity indices (CVIs) target arbitrarily shaped clusters, none of them captures the quality of the labeled noise. In this paper, we propose DISCO, a Density-based Internal Score for Clustering Outcomes, which is the first CVI that also evaluates the quality of noise labels. DISCO reliably evaluates density-based clusters of arbitrary shape by assessing compactness and separation. It also introduces a direct assessment of noise labels for any given clustering. Our experiments show that DISCO evaluates density-based clusterings more consistently than its competitors. It is additionally the first method to evaluate the complete labeling of density-based clustering methods, including noise labels.


MNIST-Nd: a set of naturalistic datasets to benchmark clustering across dimensions

arXiv.org Machine Learning

Driven by advances in recording technology, large-scale high-dimensional datasets have emerged across many scientific disciplines. Especially in biology, clustering is often used to gain insights into the structure of such datasets, for instance to understand the organization of different cell types. However, clustering is known to scale poorly to high dimensions, even though the exact impact of dimensionality is unclear as current benchmark datasets are mostly two-dimensional. Here we propose MNIST-Nd, a set of synthetic datasets that share a key property of real-world datasets, namely that individual samples are noisy and clusters do not perfectly separate. MNIST-Nd is obtained by training mixture variational autoencoders with 2 to 64 latent dimensions on MNIST, resulting in six datasets with comparable structure but varying dimensionality. It thus offers the chance to disentangle the impact of dimensionality on clustering. Preliminary common clustering algorithm benchmarks on MNIST-Nd suggest that Leiden is the most robust for growing dimensions.


Distinguished In Uniform: Self Attention Vs. Virtual Nodes

arXiv.org Artificial Intelligence

Graph Transformers (GTs) such as SAN and GPS are graph processing models that combine Message-Passing GNNs (MPGNNs) with global Self-Attention. They were shown to be universal function approximators, with two reservations: 1. The initial node features must be augmented with certain positional encodings. We first clarify that this form of universality is not unique to GTs: Using the same positional encodings, also pure MPGNNs and even 2-layer MLPs are non-uniform universal approximators. We then consider uniform expressivity: The target function is to be approximated by a single network for graphs of all sizes. There, we compare GTs to the more efficient MPGNN + Virtual Node architecture. The essential difference between the two model definitions is in their global computation method - Self-Attention Vs Virtual Node. We prove that none of the models is a uniform-universal approximator, before proving our main result: Neither model's uniform expressivity subsumes the other's. We demonstrate the theory with experiments on synthetic data. We further augment our study with real-world datasets, observing mixed results which indicate no clear ranking in practice as well. In the field of graph learning, message-passing GNNs have long been the undisputed model architecture, even though its basic form is upper bounded in expressivity by the 1-dimensional Weisfeiler-Leman algorithm (Morris et al., 2020; Xu et al., 2019).


Boosting, Voting Classifiers and Randomized Sample Compression Schemes

arXiv.org Artificial Intelligence

In boosting, we aim to leverage multiple weak learners to produce a strong learner. At the center of this paradigm lies the concept of building the strong learner as a voting classifier, which outputs a weighted majority vote of the weak learners. While many successful boosting algorithms, such as the iconic AdaBoost, produce voting classifiers, their theoretical performance has long remained sub-optimal: the best known bounds on the number of training examples necessary for a voting classifier to obtain a given accuracy has so far always contained at least two logarithmic factors above what is known to be achievable by general weak-to-strong learners. In this work, we break this barrier by proposing a randomized boosting algorithm that outputs voting classifiers whose generalization error contains a single logarithmic dependency on the sample size. We obtain this result by building a general framework that extends sample compression methods to support randomized learning algorithms based on sub-sampling.


Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark

arXiv.org Artificial Intelligence

The recent Long-Range Graph Benchmark (LRGB, Dwivedi et al. 2022) introduced a set of graph learning tasks strongly dependent on long-range interaction between vertices. Empirical evidence suggests that on these tasks Graph Transformers significantly outperform Message Passing GNNs (MPGNNs). In this paper, we carefully reevaluate multiple MPGNN baselines as well as the Graph Transformer GPS (Ramp\'a\v{s}ek et al. 2022) on LRGB. Through a rigorous empirical analysis, we demonstrate that the reported performance gap is overestimated due to suboptimal hyperparameter choices. It is noteworthy that across multiple datasets the performance gap completely vanishes after basic hyperparameter optimization. In addition, we discuss the impact of lacking feature normalization for LRGB's vision datasets and highlight a spurious implementation of LRGB's link prediction metric. The principal aim of our paper is to establish a higher standard of empirical rigor within the graph machine learning community.


Walking Out of the Weisfeiler Leman Hierarchy: Graph Learning Beyond Message Passing

arXiv.org Artificial Intelligence

We propose CRaWl, a novel neural network architecture for graph learning. Like graph neural networks, CRaWl layers update node features on a graph and thus can freely be combined or interleaved with GNN layers. Yet CRaWl operates fundamentally different from message passing graph neural networks. CRaWl layers extract and aggregate information on subgraphs appearing along random walks through a graph using 1D Convolutions. Thereby, it detects long range interactions and computes non-local features. As the theoretical basis for our approach, we prove a theorem stating that the expressiveness of CRaWl is incomparable with that of the Weisfeiler Leman algorithm and hence with graph neural networks. That is, there are functions expressible by CRaWl, but not by GNNs and vice versa. This result extends to higher levels of the Weisfeiler Leman hierarchy and thus to higher-order GNNs. Empirically, we show that CRaWl matches state-of-the-art GNN architectures across a multitude of benchmark datasets for classification and regression on graphs.


AdaBoost is not an Optimal Weak to Strong Learner

arXiv.org Artificial Intelligence

AdaBoost is a classic boosting algorithm for combining multiple inaccurate classifiers produced by a weak learner, to produce a strong learner with arbitrarily high accuracy when given enough training data. Determining the optimal number of samples necessary to obtain a given accuracy of the strong learner, is a basic learning theoretic question. Larsen and Ritzert (NeurIPS'22) recently presented the first provably optimal weak-to-strong learner. However, their algorithm is somewhat complicated and it remains an intriguing question whether the prototypical boosting algorithm AdaBoost also makes optimal use of training samples. In this work, we answer this question in the negative. Concretely, we show that the sample complexity of AdaBoost, and other classic variations thereof, are sub-optimal by at least one logarithmic factor in the desired accuracy of the strong learner.


Optimal Weak to Strong Learning

arXiv.org Artificial Intelligence

The field of boosting has been started from a classic question in learning theory asking whether classifiers that are just slightly better than random guessing can be used to create a classifier with arbitrarily high accuracy when given enough training data. This question was initially asked by Kearns and Valiant [15, 16] and ignited the line of research that eventually lead to the development of AdaBoost [7], the prototype boosting algorithm to date. AdaBoost carefully combines the predictions of several inaccurate classifiers trained with a focus on different parts of the training data to come up with a voting classifier that performs well everywhere. We quantify the performance of an inaccurate learner by its advantage over random guessing. Said loosely, a -weak learner will correctly classify new data points with probability at least 1/2+. In contrast, given 0 <, < 1 and enough training data a strong learner outputs with probability 1 over the choice of the training data and possible random choices of the algorithm a hypothesis that correctly classifies new data points with probability at least 1 .


Graph Machine Learning for Design of High-Octane Fuels

arXiv.org Artificial Intelligence

Fuels with high-knock resistance enable modern spark-ignition engines to achieve high efficiency and thus low CO2 emissions. Identification of molecules with desired autoignition properties indicated by a high research octane number and a high octane sensitivity is therefore of great practical relevance and can be supported by computer-aided molecular design (CAMD). Recent developments in the field of graph machine learning (graph-ML) provide novel, promising tools for CAMD. We propose a modular graph-ML CAMD framework that integrates generative graph-ML models with graph neural networks and optimization, enabling the design of molecules with desired ignition properties in a continuous molecular space. In particular, we explore the potential of Bayesian optimization and genetic algorithms in combination with generative graph-ML models. The graph-ML CAMD framework successfully identifies well-established high-octane components. It also suggests new candidates, one of which we experimentally investigate and use to illustrate the need for further auto-ignition training data.