Not enough data to create a plot.
Try a different view from the menu above.
Vision Mamba Mender
Mamba, a state-space model with selective mechanisms and hardware-aware architecture, has demonstrated outstanding performance in long sequence modeling tasks, particularly garnering widespread exploration and application in the field of computer vision. While existing works have mixed opinions of its application in visual tasks, the exploration of its internal workings and the optimization of its performance remain urgent and worthy research questions given its status as a novel model. Existing optimizations of the Mamba model, especially when applied in the visual domain, have primarily relied on predefined methods such as improving scanning mechanisms or integrating other architectures, often requiring strong priors and extensive trial and error. In contrast to these approaches, this paper proposes the Vision Mamba Mender, a systematic approach for understanding the workings of Mamba, identifying flaws within, and subsequently optimizing model performance. Specifically, we present methods for predictive correlation analysis of Mamba's hidden states from both internal and external perspectives, along with corresponding definitions of correlation scores, aimed at understanding the workings of Mamba in visual recognition tasks and identifying flaws therein. Additionally, tailored repair methods are proposed for identified external and internal state flaws to eliminate them and optimize model performance. Extensive experiments validate the efficacy of the proposed methods on prevalent Mamba architectures, significantly enhancing Mamba's performance.
The Many Faces of Optimal Weak-to-Strong Learning
Boosting is an extremely successful idea, allowing one to combine multiple low accuracy classifiers into a much more accurate voting classifier. In this work, we present a new and surprisingly simple Boosting algorithm that obtains a provably optimal sample complexity. Sample optimal Boosting algorithms have only recently been developed, and our new algorithm has the fastest runtime among all such algorithms and is the simplest to describe: Partition your training data into 5 disjoint pieces of equal size, run AdaBoost on each, and combine the resulting classifiers via a majority vote. In addition to this theoretical contribution, we also perform the first empirical comparison of the proposed sample optimal Boosting algorithms. Our pilot empirical study suggests that our new algorithm might outperform previous algorithms on large data sets.
SKFlow: Learning Optical Flow with Super Kernels Shangkun Sun 1 Yuanqi Chen
Optical flow estimation is a classical yet challenging task in computer vision. One of the essential factors in accurately predicting optical flow is to alleviate occlusions between frames. However, it is still a thorny problem for current top-performing optical flow estimation methods due to insufficient local evidence to model occluded areas. In this paper, we propose the Super Kernel Flow Network (SKFlow), a CNN architecture to ameliorate the impacts of occlusions on optical flow estimation. SKFlow benefits from the super kernels which bring enlarged receptive fields to complement the absent matching information and recover the occluded motions.
Stealth edits to large language models
We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. We also reveal the vulnerability of language models to stealth attacks: a small change to a model's weights which fixes its response to a single attacker-chosen prompt. Stealth attacks are computationally simple, do not require access to or knowledge of the model's training data, and therefore represent a potent yet previously unrecognised threat to redistributed foundation models. Extensive experimental results illustrate and support our methods and their theoretical underpinnings.
Stochastic Concept Bottleneck Models, Sonia Laguna
Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model's downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness. Unlike previous approaches that model the concept relations via an autoregressive structure, we introduce an explicit, distributional parameterization that allows SCBMs to retain the CBMs' efficient training and inference procedure. Additionally, we leverage the parameterization to derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly.
A Appendix
A.1 Theorems: Preliminaries A.1.1 Transforming non-autonomous into autonomous discrete-time DS Following [97], and based on similar reasoning as for continuous time (ODE-based) DS [3, 71], let us consider the non-autonomous discrete-time DS x (t 1) (t 1) (t 1) (t 1) (T 1) Let the LSTM given by (31) have a chaotic attractor attraction. Based on Oseledec's multiplicative ergodic Theorem, (39) holds for every z A.1.5 Gated Recurrent Unit (GRU) A GRU network is defined by the equations z Proposition 2. The uRNN given by (41) cannot have any chaotic orbit. Therefore (14) holds for and also, according to Oseledec's multiplicative ergodic Theorem, for every z Trivially, for such orbits that diverge to infinity (unbounded latent states) gradients of the loss function will explode as T!1. For RNNs with ReLU activation functions there are finite compartments in the phase space each with a different functional form. Based on Theorem 2, we can also formulate the necessary conditions for chaos and diverging gradients in standard RNNs with particular activation functions by considering the norms of their recurrence matrix, for which the following Corollary provides the basis: Corollary 1. Assume for the sake of contradiction that kW k apple 1.
On the difficulty of learning chaotic dynamics with RNNs Jonas M. Mikhaeil 1,2,*, and Daniel Durstewitz
Recurrent neural networks (RNNs) are wide-spread machine learning tools for modeling sequential and time series data. They are notoriously hard to train because their loss gradients backpropagated in time tend to saturate or diverge during training. This is known as the exploding and vanishing gradient problem. Previous solutions to this issue either built on rather complicated, purpose-engineered architectures with gated memory buffers, or - more recently - imposed constraints that ensure convergence to a fixed point or restrict (the eigenspectrum of) the recurrence matrix. Such constraints, however, convey severe limitations on the expressivity of the RNN.
Junlei Zhou
Text-to-Image (T2I) has witnessed significant advancements, demonstrating superior performance for various generative tasks. However, the presence of stereotypes in T2I introduces harmful biases that require urgent attention as the T2I technology becomes more prominent. Previous work for stereotype mitigation mainly concentrated on mitigating stereotypes engendered with individual objects within images, which failed to address stereotypes engendered by the association of multiple objects, referred to as Association-Engendered Stereotypes. For example, mentioning "black people" and "houses" separately in prompts may not exhibit stereotypes. Nevertheless, when these two objects are associated in prompts, the association of "black people" with "poorer houses" becomes more pronounced.
Appendix of DynaBERT: Dynamic BERT with Adaptive Width and Depth
B.1 Description of Data sets in the GLUE benchmark The GLUE benchmark [11] is a collection of diverse natural language understanding tasks, including textual entailment (RTE and MNLI), question answering (QNLI), similarity and paraphrase (MRPC, QQP, STS-B), sentiment analysis (SST-2) and linguistic acceptability (CoLA). For MNLI, we use both the matched (MNLI-m) and mismatched (MNLI-mm) sections. We do not experiment on Winograd Schema (WNLI) because even a majority baseline outperforms many methods on it. The same hyperparameters as in Table 1 are used for DynaRoBERTa. The batch size is 12 throughout the training process.
6f5216f8d89b086c18298e043bfe48ed-AuthorFeedback.pdf
Genral response: We thank all reviewers for their constructive comments. Below is our response for common questions. Q2. broader impact (R2 & R3): For the positive side, as is detailed in the Broader Impact section, DynaBERT (i) BERT models; and (iii) is more environmentally friendly due to weight sharing. Reviewer 1 Q1."whether this approach can be adapted to work during the pre-training phase": Below we show We compare with separately pre-trained small models in Google BERT repository (https://github.com/ For depth, we adjust the number of layers to be L = 4, 6.