Make Your LLM Fully Utilize the Context Shengnan An
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information.
CSI: Novelty Detection via Contrastive Learning on Distributionally Shifted Instances Jihoon Tack
Novelty detection, i.e., identifying whether a given sample is drawn from outside the training distribution, is essential for reliable machine learning. To this end, there have been many attempts at learning a representation well-suited for novelty detection and designing a score based on such representation. In this paper, we propose a simple, yet effective method named contrasting shifted instances (CSI), inspired by the recent success on contrastive learning of visual representations. Specifically, in addition to contrasting a given sample with other instances as in conventional contrastive learning methods, our training scheme contrasts the sample with distributionally-shifted augmentations of itself. Based on this, we propose a new detection score that is specific to the proposed training scheme. Our experiments demonstrate the superiority of our method under various novelty detection scenarios, including unlabeled one-class, unlabeled multi-class and labeled multi-class settings, with various image benchmark datasets. Code and pre-trained models are available at https://github.com/alinlab/CSI.
Periodic agent-state based Q-learning for POMDPs, and Aditya Mahajan McGill University, Mila
The standard approach for Partially Observable Markov Decision Processes (POMDPs) is to convert them to a fully observed belief-state MDP. However, the belief state depends on the system model and is therefore not viable in reinforcement learning (RL) settings. A widely used alternative is to use an agent state, which is a model-free, recursively updateable function of the observation history. Examples include frame stacking and recurrent neural networks. Since the agent state is model-free, it is used to adapt standard RL algorithms to POMDPs. However, standard RL algorithms like Q-learning learn a stationary policy.
A Algorithms. Algorithm 1 Training DHRL sample D
Then, there exists a constant L > 0 such that x and y, max(Dist(x y), Dist(y x)) L||x y||, where || || is the Euclidean norm, since Dist(x x) = 0. Then, any ฯต/L resolution graph w.r.t the Euclidean norm, whose existence is trivial, is an ฯต resolution graph w.r.t Dist(). The completely failed baselines are occluded by others. As shown in the table above, the wider the initial distribution, the easier it is for the agent to explore the map. In other words, the'fixed initial state distribution' condition we experimented with in this paper is a more difficult condition than the'uniform initial state distribution' that previous graph-guided RL algorithms utilize.
DHRL: A Graph-Based Approach for Long-Horizon and Sparse Hierarchical Reinforcement Learning Seungjae Lee 1,2, H. Jin Kim
Hierarchical Reinforcement Learning (HRL) has made notable progress in complex control tasks by leveraging temporal abstraction. However, previous HRL algorithms often suffer from serious data inefficiency as environments get large. The extended components, i.e., goal space and length of episodes, impose a burden on either one or both high-level and low-level policies since both levels share the total horizon of the episode. In this paper, we present a method of Decoupling Horizons Using a Graph in Hierarchical Reinforcement Learning (DHRL) which can alleviate this problem by decoupling the horizons of high-level and low-level policies and bridging the gap between the length of both horizons using a graph. DHRL provides a freely stretchable high-level action interval, which facilitates longer temporal abstraction and faster training in complex tasks. Our method outperforms state-of-the-art HRL algorithms in typical HRL environments. Moreover, DHRL achieves long and complex locomotion and manipulation tasks.
Facilitating Multimodal Classification via Dynamically Learning Modality Gap Yang Yang, Qing-Yuan Jiang
Multimodal learning falls into the trap of the optimization dilemma due to the modality imbalance phenomenon, leading to unsatisfactory performance in real applications. A core reason for modality imbalance is that the models of each modality converge at different rates. Many attempts naturally focus on adjusting learning procedures adaptively. Essentially, the reason why models converge at different rates is because the difficulty of fitting category labels is inconsistent for each modality during learning. From the perspective of fitting labels, we find that appropriate positive intervention label fitting can correct this difference in learning ability. By exploiting the ability of contrastive learning to intervene in the learning of category label fitting, we propose a novel multimodal learning approach that dynamically integrates unsupervised contrastive learning and supervised multimodal learning to address the modality imbalance problem. We find that a simple yet heuristic integration strategy can significantly alleviate the modality imbalance phenomenon. Moreover, we design a learning-based integration strategy to integrate two losses dynamically, further improving the performance. Experiments on widely used datasets demonstrate the superiority of our method compared with state-of-the-art (SOTA) multimodal learning approaches.
Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation Tong He
Recent advancements in generative models have significantly enhanced their capacity for image generation, enabling a wide range of applications such as image editing, completion and video editing. A specialized area within generative modeling is layout-to-image (L2I) generation, where predefined layouts of objects guide the generative process. In this study, we introduce a novel regional cross-attention module tailored to enrich layout-to-image generation. This module notably improves the representation of layout regions, particularly in scenarios where existing methods struggle with highly complex and detailed textual descriptions. Moreover, while current open-vocabulary L2I methods are trained in an open-set setting, their evaluations often occur in closed-set environments. To bridge this gap, we propose two metrics to assess L2I performance in open-vocabulary scenarios. Additionally, we conduct a comprehensive user study to validate the consistency of these metrics with human preferences.
TotalSelfScan: Learning Full-body Avatars from Self-Portrait Videos of Faces, Hands, and Bodies
Recent advances in implicit neural representations make it possible to reconstruct a human-body model from a monocular self-rotation video. While previous works present impressive results of human body reconstruction, the quality of reconstructed face and hands are relatively low. The main reason is that the image region occupied by these parts is very small compared to the body. To solve this problem, we propose a new approach named TotalSelfScan, which reconstructs the full-body model from several monocular self-rotation videos that focus on the face, hands, and body, respectively. Compared to recording a single video, this setting has almost no additional cost but provides more details of essential parts. To learn the full-body model, instead of encoding the whole body in a single network, we propose a multi-part representation to model separate parts and then fuse the part-specific observations into a single unified human model. Once learned, the full-body model enables rendering photorealistic free-viewpoint videos under novel human poses. Experiments show that TotalSelfScan can significantly improve the reconstruction and rendering quality on the face and hands compared to the existing methods.
Appendices A Code and Data
B.1 The Genome and Non-Coding Regulation DNA, present in every cell, stores the complete set of instructions essential for life. It consists of a chain of nucleotides--adenine (A), thymine (T), guanine (G), and cytosine (C)--whose specific sequences encode functional elements. While genes, which code for proteins, are the most recognized of these elements, they constitute only a fraction of the genome. The complete DNA sequence of an organism is referred to as its genome. Within genes, coding sequences specify the amino acid composition of proteins.
DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA Aman Patel 1 Austin Wang 1
Recent advances in self-supervised models for natural language, vision, and protein sequences have catalyzed the development of genomic DNA language models (DNALMs). These models aim to learn generalizable representations of diverse DNA elements, potentially enabling various downstream genomic prediction, interpretation and design tasks. However, existing benchmarks do not adequately assess the capabilities of DNALMs on an important class of non-coding DNA elements critical for regulating gene activity. Here, we introduce DART-Eval, a suite of representative benchmarks focused on regulatory DNA to evaluate performance of DNALMs across zero-shot, probed, and fine-tuned settings against contemporary ab initio models as baselines.