ore
Object Reprojection Error (ORE): Camera pose benchmarks from lightweight tracking annotations
Semantic modeling approaches employed on monocular video often ingest outputs from off-the-shelf SLAM/SfM pipelines, which are anecdotally observed to perform poorly or fail completely on some fraction of the videos of interest. These target videos may vary widely in complexity of scenes, activities, camera trajectory, etc. Unfortunately, such semantically-rich video data often comes with no ground-truth 3D information, and in practice it is prohibitively costly or impossible to obtain ground truth reconstructions or camera pose post-hoc. This paper proposes a novel evaluation protocol, Object Reprojection Error (ORE) to benchmark camera trajectories; ORE computes reprojection error for static objects within the video and requires only lightweight object tracklet annotations. These annotations are easy to gather on new or existing video, enabling ORE to be calculated on essentially arbitrary datasets. We show that ORE maintains high rank correlation with standard metrics based on groundtruth.
KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints
Jiang, Kailin, Jiang, Hongbo, Jiang, Ning, Gao, Zhi, Bi, Jinhe, Ren, Yuchen, Li, Bin, Du, Yuntao, Liu, Lei, Li, Qing
Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM's linear layer activations and initializes the adapter by projecting the original weights into the matrix's null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.
Transformer learns the cross-task prior and regularization for in-context learning
Transformers have shown a remarkable ability for in-context learning (ICL), making predictions based on contextual examples. However, while theoretical analyses have explored this prediction capability, the nature of the inferred context and its utility for downstream predictions remain open questions. This paper aims to address these questions by examining ICL for inverse linear regression (ILR), where context inference can be characterized by unsupervised learning of underlying weight vectors. Focusing on the challenging scenario of rank-deficient inverse problems, where context length is smaller than the number of unknowns in the weight vectors and regularization is necessary, we introduce a linear transformer to learn the inverse mapping from contextual examples to the underlying weight vector. Our findings reveal that the transformer implicitly learns both a prior distribution and an effective regularization strategy, outperforming traditional ridge regression and regularization methods. A key insight is the necessity of low task dimensionality relative to the context length for successful learning. Furthermore, we numerically verify that the error of the transformer estimator scales linearly with the noise level, the ratio of task dimension to context length, and the condition number of the input data. These results not only demonstrate the potential of transformers for solving ill-posed inverse problems, but also provide a new perspective towards understanding the knowledge extraction mechanism within transformers.
Architecture for Simulating Behavior Mode Changes in Norm-Aware Autonomous Agents
Glaze, Sean, Inclezan, Daniela
This paper presents an architecture for simulating the actions of a norm-aware intelligent agent whose behavior with respect to norm compliance is set, and can later be changed, by a human controller. Updating an agent's behavior mode from a norm-abiding to a riskier one may be relevant when the agent is involved in time-sensitive rescue operations, for example. We base our work on the Authorization and Obligation Policy Language AOPL designed by Gelfond and Lobo for the specification of norms. We introduce an architecture and a prototype software system that can be used to simulate an agent's plans under different behavior modes that can later be changed by the controller. We envision such software to be useful to policy makers, as they can more readily understand how agents may act in certain situations based on the agents' attitudes towards norm-compliance. Policy makers may then refine their policies if simulations show unwanted consequences.
Object Reprojection Error (ORE): Camera pose benchmarks from lightweight tracking annotations
Semantic modeling approaches employed on monocular video often ingest outputs from off-the-shelf SLAM/SfM pipelines, which are anecdotally observed to perform poorly or fail completely on some fraction of the videos of interest. These target videos may vary widely in complexity of scenes, activities, camera trajectory, etc. Unfortunately, such semantically-rich video data often comes with no ground-truth 3D information, and in practice it is prohibitively costly or impossible to obtain ground truth reconstructions or camera pose post-hoc. This paper proposes a novel evaluation protocol, Object Reprojection Error (ORE) to benchmark camera trajectories; ORE computes reprojection error for static objects within the video and requires only lightweight object tracklet annotations. These annotations are easy to gather on new or existing video, enabling ORE to be calculated on essentially arbitrary datasets.
Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
Muhamed, Aashiq, Li, Oscar, Woodruff, David, Diab, Mona, Smith, Virginia
Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a $2\times$ throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .
Do Parameters Reveal More than Loss for Membership Inference?
Suri, Anshuman, Zhang, Xiao, Evans, David
Membership inference attacks aim to infer whether an individual record was used to train a model, serving as a key tool for disclosure auditing. While such evaluations are useful to demonstrate risk, they are computationally expensive and often make strong assumptions about potential adversaries' access to models and training environments, and thus do not provide very tight bounds on leakage from potential attacks. We show how prior claims around black-box access being sufficient for optimal membership inference do not hold for most useful settings such as stochastic gradient descent, and that optimal membership inference indeed requires white-box access. We validate our findings with a new white-box inference attack IHA (Inverse Hessian Attack) that explicitly uses model parameters by taking advantage of computing inverse-Hessian vector products. Our results show that both audits and adversaries may be able to benefit from access to model parameters, and we advocate for further research into white-box methods for membership privacy auditing.
Persian Homograph Disambiguation: Leveraging ParsBERT for Enhanced Sentence Understanding with a Novel Word Disambiguation Dataset
Homograph disambiguation, the task of distinguishing words with identical spellings but different meanings, poses a substantial challenge in natural language processing. In this study, we introduce a novel dataset tailored for Persian homograph disambiguation. Our work encompasses a thorough exploration of various embeddings, evaluated through the cosine similarity method and their efficacy in downstream tasks like classification. Our investigation entails training a diverse array of lightweight machine learning and deep learning models for phonograph disambiguation. We scrutinize the models' performance in terms of Accuracy, Recall, and F1 Score, thereby gaining insights into their respective strengths and limitations. The outcomes of our research underscore three key contributions. First, we present a newly curated Persian dataset, providing a solid foundation for future research in homograph disambiguation. Second, our comparative analysis of embeddings highlights their utility in different contexts, enriching the understanding of their capabilities. Third, by training and evaluating a spectrum of models, we extend valuable guidance for practitioners in selecting suitable strategies for homograph disambiguation tasks. In summary, our study unveils a new dataset, scrutinizes embeddings through diverse perspectives, and benchmarks various models for homograph disambiguation. These findings empower researchers and practitioners to navigate the intricate landscape of homograph-related challenges effectively.