Goto

Collaborating Authors

 sdd




Tractable Learning for Complex Probability Queries

Jessa Bekker, Jesse Davis, Arthur Choi, Adnan Darwiche, Guy Van den Broeck

Neural Information Processing Systems

Tractable learning aims to learn probabilistic models where inference is guaranteed to be efficient. However, the particular class of queries that is tractable depends on the model and underlying representation. Usually this class is MPE or conditional probabilities Pr(x |y) for joint assignments x, y . We propose a tractable learner that guarantees efficient inference for a broader class of queries. It simultaneously learns a Markov network and its tractable circuit representation, in order to guarantee and measure tractability. Our approach differs from earlier work by using Sentential Decision Diagrams (SDD) as the tractable language instead of Arithmetic Circuits (AC). SDDs have desirable properties, which more general representations such as ACs lack, that enable basic primitives for Boolean circuit compilation. This allows us to support a broader class of complex probability queries, including counting, threshold, and parity, in polytime.


Scale-Wise VAR is Secretly Discrete Diffusion

Kumar, Amandeep, Nair, Nithin Gopalakrishnan, Patel, Vishal M.

arXiv.org Artificial Intelligence

Autoregressive (AR) transformers have emerged as a powerful paradigm for visual generation, largely due to their scalability, computational efficiency and unified architecture with language and vision. Among them, next scale prediction Visual Autoregressive Generation (VAR) has recently demonstrated remarkable performance, even surpassing diffusion-based models. In this work, we revisit VAR and uncover a theoretical insight: when equipped with a Markovian attention mask, VAR is mathematically equivalent to a discrete diffusion. We term this reinterpretation as Scalable Visual Refinement with Discrete Diffusion (SRDD), establishing a principled bridge between AR transformers and diffusion models. Leveraging this new perspective, we show how one can directly import the advantages of diffusion such as iterative refinement and reduce architectural inefficiencies into VAR, yielding faster convergence, lower inference cost, and improved zero-shot reconstruction. Across multiple datasets, we show that the diffusion based perspective of VAR leads to consistent gains in efficiency and generation.


SDD: Self-Degraded Defense against Malicious Fine-tuning

Chen, Zixuan, Lu, Weikai, Lin, Xin, Zeng, Ziqian

arXiv.org Artificial Intelligence

Open-source Large Language Models (LLMs) often employ safety alignment methods to resist harmful instructions. However, recent research shows that maliciously fine-tuning these LLMs on harmful data can easily bypass these safeguards. To counter this, we theoretically uncover why malicious fine-tuning succeeds and identify potential defense strategies. Building on the theoretical analysis, we introduce the Self-Degraded Defense (SDD) framework. SDD encourages LLMs to produce high-quality but irrelevant responses to harmful prompts. When attackers attempt malicious fine-tuning, the general capability of the LLM aligned by SDD will significantly decrease, rendering it incapable of following harmful instructions. Our experimental results confirm SDD's effectiveness against such attacks.


Machine-Precision Prediction of Low-Dimensional Chaotic Systems

Schötz, Christof, Boers, Niklas

arXiv.org Artificial Intelligence

Low-dimensional chaotic systems such as the Lorenz-63 model are commonly used to benchmark system-agnostic methods for learning dynamics from data. Here we show that learning from noise-free observations in such systems can be achieved up to machine precision: using ordinary least squares regression on high-degree polynomial features with 512-bit arithmetic, our method exceeds the accuracy of standard 64-bit numerical ODE solvers of the true underlying dynamical systems. Depending on the configuration, we obtain valid prediction times of 32 to 105 Lyapunov times for the Lorenz-63 system, dramatically outperforming prior work that reaches 13 Lyapunov times at most. We further validate our results on Thomas' Cyclically Symmetric Attractor, a non-polynomial chaotic system that is considerably more complex than the Lorenz-63 model, and show that similar results extend also to higher dimensions using the spatiotemporally chaotic Lorenz-96 model. Our findings suggest that learning low-dimensional chaotic systems from noise-free data is a solved problem.


Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models

Wang, Ya, Zhuo, Zhijian, Zeng, Yutao, Zhou, Xun, Yang, Jian, Li, Xiaoqing

arXiv.org Artificial Intelligence

Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.


Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics

Sanokowski, Sebastian, Berghammer, Wilhelm, Ennemoser, Martin, Wang, Haoyu Peter, Hochreiter, Sepp, Lehner, Sebastian

arXiv.org Machine Learning

Learning to sample from complex unnormalized distributions over discrete domains emerged as a promising research direction with applications in statistical physics, variational inference, and combinatorial optimization. Recent work has demonstrated the potential of diffusion models in this domain. However, existing methods face limitations in memory scaling and thus the number of attainable diffusion steps since they require backpropagation through the entire generative process. To overcome these limitations we introduce two novel training methods for discrete diffusion samplers, one grounded in the policy gradient theorem and the other one leveraging Self-Normalized Neural Importance Sampling (SN-NIS). These methods yield memory-efficient training and achieve state-of-the-art results in unsupervised combinatorial optimization. Numerous scientific applications additionally require the ability of unbiased sampling. We introduce adaptations of SN-NIS and Neural Markov Chain Monte Carlo that enable for the first time the application of discrete diffusion models to this problem. We validate our methods on Ising model benchmarks and find that they outperform popular autoregressive approaches. Our work opens new avenues for applying diffusion models to a wide range of scientific applications in discrete domains that were hitherto restricted to exact likelihood models.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

Abstract: the paper introduces LearnSDD an algorithm that learns log-linear models for discrete random variables but adds a penalty term for models that are expensive at query time. Compared to earlier work in this direction the paper studies a new way of describing models (SDDs instead of ACs) and is interested in "complex queries", e.g. The computational complexity of complex queries are not directly addressed in the algorithm, but as it turns out the choice of SDD as model space also has good run-time performance for certain complex queries (Theorem 1). Quality: there are no obvious errors, but some definitions in the proof are missing. Some key elements in the algorithm are not motivated/discussed (see comments below) Clarity: The presentation is good enough, but can be improved.


Sparse Data Generation Using Diffusion Models

Ostheimer, Phil, Nagda, Mayank, Kloft, Marius, Fellenz, Sophie

arXiv.org Artificial Intelligence

SDD extends Despite significant advances in generative modeling, a critical continuous state-space diffusion models by explicitly gap remains in developing models explicitly designed modeling sparsity through the introduction of for sparse data. Directly generating sparse data ensures that Sparsity Bits. Empirical validation on image data models learn realistic structures and distributions, preserving from various domains--including two scientific meaningful relationships that thresholding dense data applications, physics and biology--demonstrates would distort. Sparse data is crucial for applications like that SDD achieves high fidelity in representing data augmentation, where realistic but varied samples improve data sparsity while preserving the quality of the model robustness, and compressed representations, generated data.