Niepert, Mathias
Adaptive Physics-informed Neural Networks: A Survey
Torres, Edgar, Schiefer, Jonathan, Niepert, Mathias
Advances in machine learning have led to important applications in various fields, such as computer vision (enabling technologies like self-driving cars), natural language processing (powering intelligent agents and chatbots), and image generation (facilitating media creation). Motivated by this success, there has been growing interest in developing Machine Learning (ML) solutions to solve problems in science and engineering. Unlike other fields where data is abundant or easily obtained, however, science and engineering often face data scarcity due to the high costs associated with generating data through expensive experiments or simulations. Therefore, to facilitate the development of ML approaches in these disciplines, AI methods that are data-efficient and computationally efficient need to be created. To this end, other domains have tackled similar problems with techniques such as transfer learning, meta-learning, and few-shot learning, indicating significant potential for applying these techniques in the context of science and engineering. One specific application in science and engineering where these efficient ML models can be particularly beneficial is to determine the approximate solutions of PDEs. PDEs are fundamental for modeling and describing natural phenomena in various scientific and engineering domains. Traditionally, these equations are solved numerically, which can become prohibitively expensive, especially when dealing with nonlinear and high-dimensional problems [Han et al., 2018]. This challenge limits their application in areas where a fast evaluation of a PDE is required.
Preference-Based Gradient Estimation for ML-Based Approximate Combinatorial Optimization
Mielke, Arman, Bauknecht, Uwe, Strauss, Thilo, Niepert, Mathias
Combinatorial optimization (CO) problems arise in a wide range of fields from medicine to logistics and manufacturing. While exact solutions are often not necessary, many applications require finding high-quality solutions quickly. For this purpose, we propose a data-driven approach to improve existing non-learned approximation algorithms for CO. We parameterize the approximation algorithm and train a graph neural network (GNN) to predict parameter values that lead to the best possible solutions. Our pipeline is trained end-to-end in a self-supervised fashion using gradient estimation, treating the approximation algorithm as a black box. We propose a novel gradient estimation scheme for this purpose, which we call preference-based gradient estimation. Our approach combines the benefits of the neural network and the non-learned approximation algorithm: The GNN leverages the information from the dataset to allow the approximation algorithm to find better solutions, while the approximation algorithm guarantees that the solution is feasible. We validate our approach on two well-known combinatorial optimization problems, the travelling salesman problem and the minimum k-cut problem, and show that our method is competitive with state of the art learned CO solvers.
Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks
Bechler-Speicher, Maya, Finkelshtein, Ben, Frasca, Fabrizio, Mรผller, Luis, Tรถnshoff, Jan, Siraudin, Antoine, Zaverkin, Viktor, Bronstein, Michael M., Niepert, Mathias, Perozzi, Bryan, Galkin, Mikhail, Morris, Christopher
While machine learning on graphs has demonstrated promise in drug design and molecular property prediction, significant benchmarking challenges hinder its further progress and relevance. Current benchmarking practices often lack focus on transformative, real-world applications, favoring narrow domains like two-dimensional molecular graphs over broader, impactful areas such as combinatorial optimization, relational databases, or chip design. Additionally, many benchmark datasets poorly represent the underlying data, leading to inadequate abstractions and misaligned use cases. Fragmented evaluations and an excessive focus on accuracy further exacerbate these issues, incentivizing overfitting rather than fostering generalizable insights. These limitations have prevented the development of truly useful graph foundation models. This position paper calls for a paradigm shift toward more meaningful benchmarks, rigorous evaluation protocols, and stronger collaboration with domain experts to drive impactful and reliable advances in graph learning research, unlocking the potential of graph learning.
Symmetry-Preserving Diffusion Models via Target Symmetrization
Tong, Vinh, Ye, Yun, Hoang, Trung-Dung, Liu, Anji, Broeck, Guy Van den, Niepert, Mathias
Diffusion models are powerful tools for capturing complex distributions, but modeling data with inherent symmetries, such as molecular structures, remains challenging. Equivariant denoisers are commonly used to address this, but they introduce architectural complexity and optimization challenges, including noisy gradients and convergence issues. We propose a novel approach that enforces equivariance through a symmetrized loss function, which applies a time-dependent weighted averaging operation over group actions to the model's prediction target. This ensures equivariance without explicit architectural constraints and reduces gradient variance, leading to more stable and efficient optimization. Our method uses Monte Carlo sampling to estimate the average, incurring minimal computational overhead. We provide theoretical guarantees of equivariance for the minimizer of our loss function and demonstrate its effectiveness on synthetic datasets and the molecular conformation generation task using the GEOM-QM9 dataset. Experiments show improved sample quality compared to existing methods, highlighting the potential of our approach to enhance the scalability and practicality of equivariant diffusion models in generative tasks.
Tractable Transformers for Flexible Conditional Generation
Liu, Anji, Liu, Xuejie, Zhao, Dayuan, Niepert, Mathias, Liang, Yitao, Broeck, Guy Van den
Non-autoregressive (NAR) generative models are valuable because they can handle diverse conditional generation tasks in a more principled way than their autoregressive (AR) counterparts, which are constrained by sequential dependency requirements. Recent advancements in NAR models, such as diffusion language models, have demonstrated superior performance in unconditional generation compared to AR models (e.g., GPTs) of similar sizes. However, such improvements do not always lead to improved conditional generation performance. We show that a key reason for this gap is the difficulty in generalizing to conditional probability queries unseen during training. As a result, strong unconditional generation performance does not guarantee high-quality conditional generation. This paper proposes Tractable Transformers (Tracformer), a Transformer-based generative model that is more robust to different conditional generation tasks. Unlike existing models that rely solely on global contextual features derived from full inputs, Tracformers incorporate a sparse Transformer encoder to capture both local and global contextual information. This information is routed through a decoder for conditional generation. Empirical results demonstrate that Tracformers achieve state-of-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines.
On Zero-Initialized Attention: Optimal Prompt and Gating Factor Estimation
Diep, Nghiem T., Nguyen, Huy, Nguyen, Chau, Le, Minh, Nguyen, Duy M. H., Sonntag, Daniel, Niepert, Mathias, Ho, Nhat
The LLaMA-Adapter has recently emerged as an efficient fine-tuning technique for LLaMA models, leveraging zero-initialized attention to stabilize training and enhance performance. However, despite its empirical success, the theoretical foundations of zero-initialized attention remain largely unexplored. In this paper, we provide a rigorous theoretical analysis, establishing a connection between zero-initialized attention and mixture-of-expert models. We prove that both linear and non-linear prompts, along with gating functions, can be optimally estimated, with non-linear prompts offering greater flexibility for future applications. Empirically, we validate our findings on the open LLM benchmarks, demonstrating that non-linear prompts outperform linear ones. Notably, even with limited training data, both prompt types consistently surpass vanilla attention, highlighting the robustness and adaptability of zero-initialized attention.
Adaptive Width Neural Networks
Errica, Federico, Christiansen, Henrik, Zaverkin, Viktor, Niepert, Mathias, Alesiani, Francesco
For almost 70 years, researchers have mostly relied on hyper-parameter tuning to pick the width of neural networks' layers out of many possible choices. This paper challenges the status quo by introducing an easy-to-use technique to learn an unbounded width of a neural network's layer during training. The technique does not rely on alternate optimization nor hand-crafted gradient heuristics; rather, it jointly optimizes the width and the parameters of each layer via simple backpropagation. We apply the technique to a broad range of data domains such as tables, images, texts, and graphs, showing how the width adapts to the task's difficulty. By imposing a soft ordering of importance among neurons, it is possible to truncate the trained network at virtually zero cost, achieving a smooth trade-off between performance and compute resources in a structured way. Alternatively, one can dynamically compress the network with no performance degradation. In light of recent foundation models trained on large datasets, believed to require billions of parameters and where hyper-parameter tuning is unfeasible due to huge training costs, our approach stands as a viable alternative for width learning.
MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning
Manolache, Andrei, Tantaru, Dragos, Niepert, Mathias
In this work, we propose a simple transformer-based baseline for multimodal molecular representation learning, integrating three distinct modalities: SMILES strings, 2D graph representations, and 3D conformers of molecules. A key aspect of our approach is the aggregation of 3D conformers, allowing the model to account for the fact that molecules can adopt multiple conformations--an important factor for accurate molecular representation. The tokens for each modality are extracted using modality-specific encoders: a transformer for SMILES strings, a messagepassing neural network for 2D graphs, and an equivariant neural network for 3D conformers. The flexibility and modularity of this framework enable easy adaptation and replacement of these encoders, making the model highly versatile for different molecular tasks. The extracted tokens are then combined into a unified multimodal sequence, which is processed by a downstream transformer for prediction tasks. To efficiently scale our model for large multimodal datasets, we utilize Flash Attention 2 and bfloat16 precision. Despite its simplicity, our approach achieves state-of-the-art results across multiple datasets, demonstrating its effectiveness as a strong baseline for multimodal molecular representation learning.
LoGra-Med: Long Context Multi-Graph Alignment for Medical Vision-Language Model
Nguyen, Duy M. H., Diep, Nghiem T., Nguyen, Trung Q., Le, Hoang-Bao, Nguyen, Tai, Nguyen, Tien, Nguyen, TrungTin, Ho, Nhat, Xie, Pengtao, Wattenhofer, Roger, Zhou, James, Sonntag, Daniel, Niepert, Mathias
State-of-the-art medical multi-modal large language models (med-MLLM), like LLaVA-Med or BioMedGPT, leverage instruction-following data in pre-training. However, those models primarily focus on scaling the model size and data volume to boost performance while mainly relying on the autoregressive learning objectives. Surprisingly, we reveal that such learning schemes might result in a weak alignment between vision and language modalities, making these models highly reliant on extensive pre-training datasets - a significant challenge in medical domains due to the expensive and time-consuming nature of curating high-quality instruction-following instances. We address this with LoGra-Med, a new multi-graph alignment algorithm that enforces triplet correlations across image modalities, conversation-based descriptions, and extended captions. This helps the model capture contextual meaning, handle linguistic variability, and build cross-modal associations between visuals and text. To scale our approach, we designed an efficient end-to-end learning scheme using black-box gradient estimation, enabling faster LLaMa 7B training. Our results show LoGra-Med matches LLAVA-Med performance on 600K image-text pairs for Medical VQA and significantly outperforms it when trained on 10% of the data. For example, on VQA-RAD, we exceed LLAVA-Med by 20.13% and nearly match the 100% pre-training score (72.52% vs. 72.64%). We also surpass SOTA methods like BiomedGPT on visual chatbots and RadFM on zero-shot image classification with VQA, highlighting the effectiveness of multi-graph alignment.
Discrete Copula Diffusion
Liu, Anji, Broadrick, Oliver, Niepert, Mathias, Broeck, Guy Van den
Discrete diffusion models have recently shown significant progress in modeling complex data, such as natural languages and DNA sequences. However, unlike diffusion models for continuous data, which can generate high-quality samples in just a few denoising steps, modern discrete diffusion models still require hundreds or even thousands of denoising steps to perform well. In this paper, we identify a fundamental limitation that prevents discrete diffusion models from achieving strong performance with fewer steps -- they fail to capture dependencies between output variables at each denoising step. To address this issue, we provide a formal explanation and introduce a general approach to supplement the missing dependency information by incorporating another deep generative model, termed the copula model. Our method does not require fine-tuning either the diffusion model or the copula model, yet it enables high-quality sample generation with significantly fewer denoising steps. When we apply this approach to autoregressive copula models, the combined model outperforms both models individually in unconditional and conditional text generation. Specifically, the hybrid model achieves better (un)conditional text generation using 8 to 32 times fewer denoising steps than the diffusion model alone. In addition to presenting an effective discrete diffusion generation algorithm, this paper emphasizes the importance of modeling inter-variable dependencies in discrete diffusion.