Dynamic Local Regret for Non-convex Online Forecasting

Neural Information Processing Systems

We consider online forecasting problems for non-convex machine learning models. Forecasting introduces several challenges such as (i) frequent updates are necessary to deal with concept drift issues since the dynamics of the environment change over time, and (ii) the state of the art models are non-convex models. We address these challenges with a novel regret framework. Standard regret measures commonly do not consider both dynamic environment and non-convex models. We introduce a local regret for non-convex models in a dynamic environment. We present an update rule incurring a cost, according to our proposed local regret, which is sublinear in time T. Our update uses time-smoothed gradients. Using a real-world dataset we show that our time-smoothed approach yields several benefits when compared with state-of-the-art competitors: results are more stable against new data; training is more robust to hyperparameter selection; and our approach is more computationally efficient than the alternatives.


50a074e6a8da4662ae0a29edde722179-AuthorFeedback.pdf

Neural Information Processing Systems

REVIEWER 2 Thank you for your encouraging comments. REVIEWER 3 Thank you for your comments. REVIEWER 4 Thank you for your comments. Without some formal notion or even toy scenario for concept drift, it's not clear what theoretical basis there is to prefer Call this the oracle policy. Call this the stale policy.


APPENDIX

Neural Information Processing Systems

Universal approximation for densities is a property often discussed in the context of autoregressive normalizing flows. It can be shown, based on the proof of existence and non-uniqueness of solutions to the nonlinear ICA problem [29], that any distribution can be mapped onto a factorized base distribution by an invertible function with triangular Jacobian, provided that the function class used for this mapping is large enough. Normalizing flows with triangular Jacobians and a high number of parameters therefore have this approximation capacity (see e.g.


Relative gradient optimization of the Jacobian term in unsupervised deep learning Luigi Gresele

Neural Information Processing Systems

Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning. A popular approach for solving it is mapping the observations into a representation space with a simple joint distribution, which can typically be written as a product of its marginals -- thus drawing a connection with the field of nonlinear independent component analysis. Deep density models have been widely used for this task, but their maximum likelihood based training requires estimating the log-determinant of the Jacobian and is computationally expensive, thus imposing a trade-off between computation and expressive power. In this work, we propose a new approach for exact training of such neural networks. Based on relative gradients, we exploit the matrix structure of neural network parameters to compute updates efficiently even in high-dimensional spaces; the computational cost of the training is quadratic in the input size, in contrast with the cubic scaling of naive approaches. This allows fast training with objective functions involving the log-determinant of the Jacobian, without imposing constraints on its structure, in stark contrast to autoregressive normalizing flows.


c10f48884c9c7fdbd9a7959c59eebea8-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their comments and the largely positive feedback. Reviewers agree that "the paper clearly The improvement our approach provides "is demonstrated by experiments" The contribution was praised as "elegant", R6: Rigorous formulation and convergence properties of relative gradient: We will add more details on this. We will include these references in the paper. These architectures have several limitations, e.g. they The drawback in this approach is that the permutation matrix P cannot be learned. We will include this discussion and reference in the paper.



Transferable Adversarial Attacks on SAM and Its Downstream Models

Neural Information Processing Systems

The utilization of large foundational models has a dilemma: while fine-tuning downstream tasks from them holds promise for making use of the well-generalized knowledge in practical applications, their open accessibility also poses threats of adverse usage. This paper, for the first time, explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM), by solely utilizing the information from the open-sourced SAM. In contrast to prevailing transfer-based adversarial attacks, we demonstrate the existence of adversarial dangers even without accessing the downstream task and dataset to train a similar surrogate model. To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm to extract the intrinsic vulnerability inherent in the foundation model, which is then utilized as the prior knowledge to guide the generation of adversarial perturbations. Moreover, by formulating the gradient difference in the attacking process between the open-sourced SAM and its fine-tuned downstream models, we theoretically demonstrate that a deviation occurs in the adversarial update direction by directly maximizing the distance of encoded feature embeddings in the open-sourced SAM. Consequently, we propose a gradient robust loss that simulates the associated uncertainty with gradient-based noise augmentation to enhance the robustness of generated adversarial examples (AEs) towards this deviation, thus improving the transferability. Extensive experiments demonstrate the effectiveness of the proposed universal meta-initialized and gradient robust adversarial attack (UMI-GRAT) toward SAMs and their downstream models.


LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment

Neural Information Processing Systems

Although large language models (LLMs) have demonstrated their strong intelligence ability, the high demand for computation and storage hinders their practical application. To this end, many model compression techniques are proposed to increase the efficiency of LLMs. However, current researches only validate their methods on limited models, datasets, metrics, etc, and still lack a comprehensive evaluation under more general scenarios. So it is still a question of which model compression approach we should use under a specific case. To mitigate this gap, we present the Large Language Model Compression Benchmark (LLMCBench), a rigorously designed benchmark with an in-depth analysis for LLM compression algorithms. We first analyze the actual model production requirements and carefully design evaluation tracks and metrics. Then, we conduct extensive experiments and comparison using multiple mainstream LLM compression approaches. Finally, we perform an in-depth analysis based on the evaluation and provide useful insight for LLM compression design. We hope our LLMCBench can contribute insightful suggestions for LLM compression algorithm design and serve as a foundation for future research.


Figure 1: Phase transitions of ˆv(m, 0.2, 0.01, s) Figure 2: Phase transitions of ˆv(m, 0.05, 0.05, s)

Neural Information Processing Systems

Thank you very much for your reviews. The trends match trends in the submission as expected. As mentioned in footnote 20, the design is based on Section 3.1 of [13] (for s = 1). How do I compute ˆδ(s, m, ɛ, w)? When ɛ = 0.07 and m = 500, there is similarly a local maximum (somewhere in 24 s 32) followed by a I appreciate that you read through my supplementary material, and I will certainly address the typos you noted.


Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

Neural Information Processing Systems

This work studies the problem of learning episodic Markov Decision Processes with known transition and bandit feedback. We develop the first algorithm with a "best-of-both-worlds" guarantee: it achieves O(log T) regret when the losses are stochastic, and simultaneously enjoys worst-case robustness with Õ( T) regret even when the losses are adversarial, where T is the number of episodes. More generally, it achieves Õ( C) regret in an intermediate setting where the losses are corrupted by a total amount of C. Our algorithm is based on the Followthe-Regularized-Leader method from Zimin and Neu [26], with a novel hybrid regularizer inspired by recent works of Zimmert et al. [27, 29] for the special case of multi-armed bandits. Crucially, our regularizer admits a non-diagonal Hessian with a highly complicated inverse. Analyzing such a regularizer and deriving a particular self-bounding regret guarantee is our key technical contribution and might be of independent interest.