Goto

Collaborating Authors

 vector


d39fb2054215f07d1f90cc80c7a85edd-Paper-Conference.pdf

Neural Information Processing Systems

Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by randomly drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation): a canonical testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first canonical case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.


Reparameterized LLMTraining via Orthogonal Equivalence Transformation

Neural Information Processing Systems

While Large language models (LLMs) are driving the rapid advancement of artificial intelligence, effectively and reliably training these large models remains one of the field's most significant challenges. To address this challenge, we propose POET, a novel reParameterized training algorithm that uses Orthogonal Equivalence Transformation to optimize neurons. Specifically, POET reparameterizes each neuron with two learnable orthogonal matrices and a fixed random weight matrix. Because of its provable preservation of spectral properties of weight matrices, POET can stably optimize the objective function with improved generalization. We further develop efficient approximations that make POET flexible and scalable for training large-scale neural networks.


1 Appendix 2 AMore Details

Neural Information Processing Systems

Score 0 4 (normal) is most common across cohorts, while score 3 (severe) is rare--especially in PD-GaM 5 and 3DGait, highlighting class imbalance challenges. BMCLab offers a balanced ON/OFF medication split, 7 while E-LC is skewed toward ON-medication. DNE includes healthy, Parkinsonian, and other disease 8 groups for broader contrastive training. Figure A.3 shows label distributions for FoG-related cohorts. This artifact likely stems from the unusual top-down perspective--different from the front15 facing or side views seen in WHAM's training data [1]. While motion encoder-based models may be 16 robust to such distortions, feature-based gait classifiers rely on precise kinematic measurements and 17 thus require carefully corrected input data. To correct this slope artifact, we perform a frame-wise 18 rigid alignment of the reconstructed SMPL skeleton using the Kabsch algorithm [2]. The goal is to 19 rotate each frame so that anatomical directions align with canonical coordinate axes (up, forward), 20 while preserving natural gait structure. This motion 28 vector is then projected onto the ground plane (xz-plane) and used as the walking axis. In frames where the sacrum displacement is less than 30 4mm--indicating near-stationary posture--we fall back on a proxy direction: the cross product of 31 the hip vector (left hip to right hip) and the vertical vector.


How Data Mixing Shapes In-Context Learning: Asymptotic Equivalence for Transformers with MLPs

Neural Information Processing Systems

Pretrained Transformers demonstrate remarkable in-context learning (ICL) capabilities, enabling them to adapt to new tasks from demonstrations without parameter updates. However, theoretical studies often rely on simplified architectures (e.g., omitting MLPs), plain data models (e.g., linear regression with isotropic inputs), and single-source training--limiting their relevance to realistic settings. In this work, we study ICL in pretrained Transformers with nonlinear MLP heads on nonlinear tasks drawn from multiple data sources with heterogeneous input, task, and noise distributions. We analyze a model where the MLP comprises two layers, with the first layer trained via a single gradient step and the second layer fully optimized. Under high-dimensional asymptotics, we prove that such models are equivalent in ICL error to structured polynomial predictors, leveraging results from the theory of Gaussian universality and orthogonal polynomials. This equivalence reveals that nonlinear MLPs meaningfully enhance ICL performance--particularly on nonlinear tasks--compared to linear baselines.



Angular Constraint Embedding via SpherePair Loss for Constrained Clustering

Neural Information Processing Systems

However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with stateof-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at our repository.


Tensor Product Attention Is All You Need

Neural Information Processing Systems

Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with RoPE and any possible position encoding mechanisms, TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor ProducTATTenTion Transformer (T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models.


ATRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity

Neural Information Processing Systems

Multimodal learning plays a pivotal role in advancing artificial intelligence systems by incorporating information from multiple modalities to build a more comprehensive representation. Despite its importance, current state-of-the-art models still suffer from severe limitations that prevent the successful development of a fully multimodal model. Such methods may not provide indicators that all the involved modalities are effectively aligned. As a result, some modalities may not be aligned, undermining the effectiveness of the model in downstream tasks where multiple modalities should provide additional information that the model fails to exploit. In this paper, we present TRIANGLE: TRI-modAl Neural Geometric LEarning, the novel proposed similarity measure that is directly computed in the higher-dimensional space spanned by the modality embeddings. TRIANGLE improves the joint alignment of three modalities via a triangle-area similarity, avoiding additional fusion layers or pairwise similarities. When incorporated in contrastive losses replacing cosine similarity, TRIANGLE significantly boosts the performance of multimodal modeling, while yielding interpretable alignment rationales. Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.


Correlation Dimension of Auto-Regressive Large Language Models

Neural Information Processing Systems

Large language models (LLMs) have achieved remarkable progress in natural language generation, yet they continue to display puzzling behaviors--such as repetition and incoherence--even when exhibiting low perplexity. This highlights a key limitation of conventional evaluation metrics, which emphasize local prediction accuracy while overlooking long-range structural complexity. We introduce correlation dimension, a fractal-geometric measure of self-similarity, to quantify the epistemological complexity of text as perceived by a language model.


FoGE: Fock Space inspired encoding for graph prompting

Neural Information Processing Systems

Recent results show that modern Large Language Models (LLM) are capable of understanding and answering questions about structured data such as graphs. Existing proposals often use some description of the graph to create an "augmented" prompt fed to the LLM. For a chosen class of graphs, if a well-tailored graph encoder is deployed to play together with a pre-trained LLM, the model can answer graph-related questions well. Current solutions to graph-based prompts range from graph serialization to graph transformers. In this work, we show that the use of a parameter-free graph encoder based on Fock space representations, a concept borrowed from physics, is remarkably versatile in this problem setting. The simple construction, with a few small adjustments, can provide rich and informative graph encodings, for a wide range of different graphs. We investigate the use of this idea for prefix-tuned prompts leveraging the capabilities of a pre-trained, frozen LLM. The modifications lead to a model that can answer graph-related questions - from simple graphs to proteins to hypergraphs - effectively and with minimal, if any, adjustments to the architecture.