A Proofs

Neural Information Processing Systems

As mentioned in sections 3 and 4, our dataset D contains the random perturbation vectors ξ and side information ψ. B.1 Loss function Mathematically, the conditional total variation loss function (11) can be explicitly written as: (ψ They are trained to learn lower dimension data representations at the bottleneck of the network. They have the capability to learn representations in a fully unsupervised way which makes them suitable for the task at hand. The decoder is a mirrored version of the encoder. The input layer is fully connected to the output layers with an intermediate ReLU activation layer in both the encoder and the decoder.


Optimal-state Dynamics Estimation for Physics-based Human Motion Capture from Videos

Neural Information Processing Systems

Human motion capture from monocular videos has made significant progress in recent years. However, modern approaches often produce temporal artifacts, e.g. in form of jittery motion and struggle to achieve smooth and physically plausible motions. Explicitly integrating physics, in form of internal forces and exterior torques, helps alleviating these artifacts. Current state-of-the-art approaches make use of an automatic PD controller to predict torques and reaction forces in order to re-simulate the input kinematics, i.e. the joint angles of a predefined skeleton. However, due to imperfect physical models, these methods often require simplifying assumptions and extensive preprocessing of the input kinematics to achieve good performance.


DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity

Neural Information Processing Systems

Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary data distributions, and its underlying mechanism is poorly understood. We develop a framework emulating real-world neural network training and identify noise memorization as the primary cause of plasticity loss when warm-starting on stationary data. Motivated by this, we propose Direction-Aware SHrinking (DASH), a method aiming to mitigate plasticity loss by selectively forgetting memorized noise while preserving learned features.


Pessimism Meets Invariance: Provably Efficient Offline Mean-Field Multi-Agent RL Minshuo Chen 1 Yan Li1 Ethan Wang 1 Zhuoran Yang

Neural Information Processing Systems

Mean-Field Multi-Agent Reinforcement Learning (MF-MARL) is attractive in the applications involving a large population of homogeneous agents, as it exploits the permutation invariance of agents and avoids the curse of many agents. Most existing results only focus on online settings, in which agents can interact with the environment during training. In some applications such as social welfare optimization, however, the interaction during training can be prohibitive or even unethical in the societal systems. To bridge such a gap, we propose a SAFARI (peSsimistic meAn-Field vAlue iteRatIon) algorithm for off-line MF-MARL, which only requires a handful of pre-collected experience data.


TinyTTA: Efficient Test-Time Adaptation via Early-Exit Ensembles on Edge Devices

Neural Information Processing Systems

The increased adoption of Internet of Things (IoT) devices has led to the generation of large data streams with applications in healthcare, sustainability, and robotics. In some cases, deep neural networks have been deployed directly on these resource-constrained units to limit communication overhead, increase efficiency and privacy, and enable real-time applications. However, a common challenge in this setting is the continuous adaptation of models necessary to accommodate changing environments, i.e., data distribution shifts. Test-time adaptation (TTA) has emerged as one potential solution, but its validity has yet to be explored in resource-constrained hardware settings, such as those involving microcontroller units (MCUs). TTA on constrained devices generally suffers from i) memory overhead due to the full backpropagation of a large pre-trained network, ii) lack of support for normalization layers on MCUs, and iii) either memory exhaustion with large batch sizes required for updating or poor performance with small batch sizes. In this paper, we propose TinyTTA, to enable, for the first time, efficient TTA on constrained devices with limited memory. To address the limited memory constraints, we introduce a novel self-ensemble and batch-agnostic early-exit strategy for TTA, which enables continuous adaptation with small batch sizes for reduced memory usage, handles distribution shifts, and improves latency efficiency. Moreover, we develop the TinyTTA Engine, a first-of-its-kind MCU library that enables on-device TTA.


A Proofs

Neural Information Processing Systems

's theoretical guarantees on the learning result, sample complexity analysis, as well as the capability to recover individual level of mixture parameters makes the unique contribution of this work. One limitation in our framework is related to data collection. For each decision maker, we require a series of historical choice data over a fixed set of options in order to get a reliable empirical cumulative distribution of their choice probability vector. However, this problem is not unique to our framework and EM also requires the same type of data.


Exploiting Activation Sparsity with Dense to Dynamic-k Mixture-of-Experts Conversion Bartosz Wójcik IDEAS NCBR Warsaw University of Technology Jagiellonian University Mikołaj Piórczyński

Neural Information Processing Systems

Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wallclock speedup. The proposed method, Dense to Dynamic-k Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.


Rotation-Invariant Local-to-Global Representation Learning for 3D Point Cloud Supplementary Document

Neural Information Processing Systems

Our model is implemented in Tensorflow. For the 3D object classification experiments, the learning rate is 0.001 and the batch size is 32. In each hierarchy, the number of neighbors for the stochastic dilated k-NNs, denoted by k, is sampled from the following three different uniform distributions: U(32, 96), U(16, 48), and U(8, 24). The dilation rate employed in the first descriptor extraction stage is sampled from U(2, 4). Meanwhile, the number of edges in the graph, ˆk, is given by sampling from U(8, 24), U(4, 12), and U(2, 8).


Rotation-Invariant Local-to-Global Representation Learning for 3D Point Cloud

Neural Information Processing Systems

We propose a local-to-global representation learning algorithm for 3D point cloud data, which is appropriate for handling various geometric transformations, especially rotation, without explicit data augmentation with respect to the transformations. Our model takes advantage of multi-level abstraction based on graph convolutional neural networks, which constructs a descriptor hierarchy to encode rotation-invariant shape information of an input object in a bottom-up manner. The descriptors in each level are obtained from a neural network based on a graph via stochastic sampling of 3D points, which is effective in making the learned representations robust to the variations of input data. The proposed algorithm presents the state-of-the-art performance on the rotation-augmented 3D object recognition and segmentation benchmarks. We further analyze its characteristics through comprehensive ablative experiments.


Beyond Accuracy: Ensuring Correct Predictions With Correct Rationales

Neural Information Processing Systems

Large pretrained foundation models demonstrate exceptional performance and, in some high-stakes applications, even surpass human experts. However, most of these models are currently evaluated primarily on prediction accuracy, overlooking the validity of the rationales behind their accurate predictions. For the safe deployment of foundation models, there is a pressing need to ensure double-correct predictions, i.e., correct prediction backed by correct rationales. To achieve this, we propose a two-phase scheme: First, we curate a new dataset that offers structured rationales for visual recognition tasks. Second, we propose a rationale-informed optimization method to guide the model in disentangling and localizing visual evidence for each rationale, without requiring manual annotations. Extensive experiments and ablation studies demonstrate that our model outperforms state-of-the-art models by up to 10.1% in prediction accuracy across a wide range of tasks. Furthermore, our method significantly improves the model's rationale correctness, improving localization by 7.5% and disentanglement by 36.5%.