Plotting

c1502ae5a4d514baec129f72948c266e-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for valuable feedback. Before addressing individual comments, we clarify common concerns. Moreover, "image-level" vs "pixel-level" training has no bearing on the validity of evaluating with Any method that uses a CNN learns more than just "image-level" representations; for Results are: ours 47.2 vs MoCo 46.9 mIOU. Suggested by R4, we retrain our model on COCO+VOC with HED edges and achieve 49.9 mIOU in above mentioned Our task is to learn pixel-wise semantic-aware embeddings from scratch. We will update the final version to reflect the full 200 training epochs.


NAOMI: Non-Autoregressive Multiresolution Sequence Imputation

Neural Information Processing Systems

Missing value imputation is a fundamental problem in spatiotemporal modeling, from motion tracking to the dynamics of physical systems. Deep autoregressive models suffer from error propagation which becomes catastrophic for imputing long-range sequences. In this paper, we take a non-autoregressive approach and propose a novel deep generative model: Non-AutOregressive Multiresolution Imputation (NAOMI) to impute long-range sequences given arbitrary missing patterns. NAOMI exploits the multiresolution structure of spatiotemporal data and decodes recursively from coarse to fine-grained resolutions using a divide-andconquer strategy. We further enhance our model with adversarial training. When evaluated extensively on benchmark datasets from systems of both deterministic and stochastic dynamics. In our experiments, NAOMI demonstrates significant improvement in imputation accuracy (reducing average error by 60% compared to autoregressive counterparts) and generalization for long-range sequences.


states (h f, h

Neural Information Processing Systems

We thank the reviewers for their insightful comments. We first clarify our approach and then address specific concerns. Note that encoder and decoder share weights. We encourage the reviewers to check the supplementary material, with code and visualizations of our decoding strategy. Evaluating generative models is an open problem, e.g., log-likelihood does not correlate In our case, neither L2 nor log-likelihood can capture "realistic" L2-loss for the basketball dataset, but note that NAOMI (0.013) still outperforms SingleRes (0.040).


Causal Imitation for Markov Decision Processes: a Partial Identification Approach

Neural Information Processing Systems

Imitation learning enables an agent to learn from expert demonstrations when the performance measure is unknown and the reward signal is not specified. Standard imitation methods do not generally apply when the learner and the expert's sensory capabilities mismatch and demonstrations are contaminated with unobserved confounding bias. To address these challenges, recent advancements in causal imitation learning have been pursued. However, these methods often require access to underlying causal structures that might not always be available, posing practical challenges. In this paper, we investigate robust imitation learning within the framework of canonical Markov Decision Processes (MDPs) using partial identification, allowing the agent to achieve expert performance even when the system dynamics are not uniquely determined from the confounded expert demonstrations. Specifically, first, we theoretically demonstrate that when unobserved confounders (UCs) exist in an MDP, the learner is generally unable to imitate expert performance. We then explore imitation learning in partially identifiable settings -- either transition distribution or reward function is non-identifiable from the available data and knowledge. Augmenting the celebrated GAIL method (Ho & Ermon, 2016), our analysis leads to two novel causal imitation algorithms that can obtain effective policies guaranteed to achieve expert performance.


MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability

Neural Information Processing Systems

Large Language Models (LLMs) are increasingly deployed in various applications. As their usage grows, concerns regarding their safety are rising, especially in maintaining harmless responses when faced with malicious instructions. Many defense strategies have been developed to enhance the safety of LLMs. However, our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance, thereby diminishing the usability of their responses to benign instructions. To solve this problem, we introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability. Our MoGU framework transforms the base LLM into two variants: the usable LLM and the safe LLM, and further employs dynamic routing to balance their contribution. When encountering malicious instructions, the router will assign a higher weight to the safe LLM to ensure that responses are harmless.


Dynamic Local Regret for Non-convex Online Forecasting

Neural Information Processing Systems

We consider online forecasting problems for non-convex machine learning models. Forecasting introduces several challenges such as (i) frequent updates are necessary to deal with concept drift issues since the dynamics of the environment change over time, and (ii) the state of the art models are non-convex models. We address these challenges with a novel regret framework. Standard regret measures commonly do not consider both dynamic environment and non-convex models. We introduce a local regret for non-convex models in a dynamic environment. We present an update rule incurring a cost, according to our proposed local regret, which is sublinear in time T. Our update uses time-smoothed gradients. Using a real-world dataset we show that our time-smoothed approach yields several benefits when compared with state-of-the-art competitors: results are more stable against new data; training is more robust to hyperparameter selection; and our approach is more computationally efficient than the alternatives.


50a074e6a8da4662ae0a29edde722179-AuthorFeedback.pdf

Neural Information Processing Systems

REVIEWER 2 Thank you for your encouraging comments. REVIEWER 3 Thank you for your comments. REVIEWER 4 Thank you for your comments. Without some formal notion or even toy scenario for concept drift, it's not clear what theoretical basis there is to prefer Call this the oracle policy. Call this the stale policy.


APPENDIX

Neural Information Processing Systems

Universal approximation for densities is a property often discussed in the context of autoregressive normalizing flows. It can be shown, based on the proof of existence and non-uniqueness of solutions to the nonlinear ICA problem [29], that any distribution can be mapped onto a factorized base distribution by an invertible function with triangular Jacobian, provided that the function class used for this mapping is large enough. Normalizing flows with triangular Jacobians and a high number of parameters therefore have this approximation capacity (see e.g.


Relative gradient optimization of the Jacobian term in unsupervised deep learning Luigi Gresele

Neural Information Processing Systems

Learning expressive probabilistic models correctly describing the data is a ubiquitous problem in machine learning. A popular approach for solving it is mapping the observations into a representation space with a simple joint distribution, which can typically be written as a product of its marginals -- thus drawing a connection with the field of nonlinear independent component analysis. Deep density models have been widely used for this task, but their maximum likelihood based training requires estimating the log-determinant of the Jacobian and is computationally expensive, thus imposing a trade-off between computation and expressive power. In this work, we propose a new approach for exact training of such neural networks. Based on relative gradients, we exploit the matrix structure of neural network parameters to compute updates efficiently even in high-dimensional spaces; the computational cost of the training is quadratic in the input size, in contrast with the cubic scaling of naive approaches. This allows fast training with objective functions involving the log-determinant of the Jacobian, without imposing constraints on its structure, in stark contrast to autoregressive normalizing flows.


c10f48884c9c7fdbd9a7959c59eebea8-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their comments and the largely positive feedback. Reviewers agree that "the paper clearly The improvement our approach provides "is demonstrated by experiments" The contribution was praised as "elegant", R6: Rigorous formulation and convergence properties of relative gradient: We will add more details on this. We will include these references in the paper. These architectures have several limitations, e.g. they The drawback in this approach is that the permutation matrix P cannot be learned. We will include this discussion and reference in the paper.