Goto

Collaborating Authors

 col


On the Spectral Structure and Objective Equivalence of Orthogonal Multilabel Fisher Discriminants

arXiv.org Machine Learning

We provide a unified theoretical analysis of Linear Discriminant Analysis with simultaneous multilabel scatter matrix formulations and Stiefel orthogonality constraints. Our contributions span both algebraic structure and statistical guarantees. On the algebraic side, we characterize the rank of the multilabel between-class scatter matrix, showing that the effective discriminant dimensionality can strictly exceed the classical single-label bound of $C-1$; we establish a multilabel partition of variance and prove that all four Fisher objectives are equivalent under the $W^\top S_t^{ML} W = I_r$ constraint while characterizing their divergence under the Stiefel constraint; and we prove a two-sided label-distance preservation bound relating projected distances to Hamming distances in label space. On the statistical side, we establish a finite-sample $O(k_{\max}\sqrt{d\log d/n}/gap_r)$ bound on the subspace estimation error under sub-Gaussian noise with a matching $ฮฉ(ฯƒ^2 d/(n\,gap_r))$ minimax lower bound, establishing a near-minimax-optimal rate (matching up to logarithmic and $k_{\max}$ factors) for multilabel discriminant subspace estimation. We further provide high-probability distance concentration, robustness guarantees under label interactions, and a regularization analysis preserving the spectral structure when $d \gg n$. All results are verified numerically on synthetic data generated from the linear label-effect model, covering both the algebraic identities and the multilabel-specific quantities ($k_{\max}$, $ฮบ(S_t^{ML})$, $\|ฮ“/n\|_2$, $ฮ”_r$) that govern the statistical bounds. The numerical experiments are designed as a sanity check for the theorems rather than as an empirical benchmark; evaluation on real multilabel datasets is left to future work targeting application-oriented venues.


Training Details and Model

Neural Information Processing Systems

We set the patch size to be 8. Our model is optimized by AdamW optimizer [3] with a learning rate2 of 0.0004, 250k training steps, linearly warm-up of 5000 steps and an exponentially weight-decaying3 schedule. The gradient norm is clipped at 1. We use Pytorch automatic mixed-precision and data4 paralleling for training acceleration. All models are trained on 4 Nvidia RTXA5000 GPUs with a5 total batch size of 128.


Object centric Cyclic Walks between Parts and Whole

Neural Information Processing Systems

Learning object-centric representations from complex natural environments enables both humans and machines with reasoning abilities from low-level perceptual features. To capture compositional entities of the scene, we proposed cyclic walks between perceptual features extracted from vision transformers and object entities. First, a slot-attention module interfaces with these perceptual features and produces a finite set of slot representations. These slots can bind to any object entities in the scene via inter-slot competitions for attention. Next, we establish entity-feature correspondence with cyclic walks along high transition probability based on the pairwise similarity between perceptual features (aka "parts") and slot-binded object representations (aka "whole").


FedAvgwithFineTuning: LocalUpdatesLeadto RepresentationLearning

Neural Information Processing Systems

Federated Learning (FL) [1]provides acommunication-efficient andprivacypreserving means to learn from data distributed across clients such as cell phones, autonomous vehicles, and hospitals. FL aims for each client to benefit from collaborating in the learning process without sacrificing data privacy or paying a substantial communication cost. Federated Averaging (FedAvg) [1] is the predominant FL algorithm.



Mesh-TensorFlow: Deep Learning for Supercomputers

Neural Information Processing Systems

However,batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately,efficient model-parallel algorithms tend tobe complicated todiscover, describe, and to implement, particularly on large clusters.


Sample Complexity of Interventional Causal Representation Learning

Neural Information Processing Systems

Consider a data-generation process that transforms low-dimensional latent causally-related variables to high-dimensional observed variables. Causal representation learning (CRL) is the process of using the observed data to recover the latent causal variables and the causal structure among them.