Goto

Collaborating Authors

 drc


Path Channels and Plan Extension Kernels: a Mechanistic Description of Planning in a Sokoban RNN

Taufeeque, Mohammad, Tucker, Aaron David, Gleave, Adam, Garriga-Alonso, Adrià

arXiv.org Artificial Intelligence

We partially reverse-engineer a convolutional recurrent neural network (RNN) trained with model-free reinforcement learning to play the box-pushing game Sokoban. We find that the RNN stores future moves (plans) as activations in particular channels of the hidden state, which we call path channels. A high activation in a particular location means that, when a box is in that location, it will get pushed in the channel's assigned direction. We examine the convolutional kernels between path channels and find that they encode the change in position resulting from each possible action, thus representing part of a learned transition model. The RNN constructs plans by starting at the boxes and goals. These kernels extend activations in path channels forwards from boxes and backwards from the goal. Negative values are placed in channels at obstacles. This causes the extension kernels to propagate the negative value in reverse, thus pruning the last few steps and letting an alternative plan emerge; a form of backtracking. Our work shows that, a precise understanding of the plan representation allows us to directly understand the bidirectional planning-like algorithm learned by model-free training in more familiar terms.


On the Role of Context for Discourse Relation Classification in Scientific Writing

Wan, Stephen, Liu, Wei, Strube, Michael

arXiv.org Artificial Intelligence

With the increasing use of generative Artificial Intelligence (AI) methods to support science workflows, we are interested in the use of discourse-level information to find supporting evidence for AI generated scientific claims. A first step towards this objective is to examine the task of inferring discourse structure in scientific writing. In this work, we present a preliminary investigation of pretrained language model (PLM) and Large Language Model (LLM) approaches for Discourse Relation Classification (DRC), focusing on scientific publications, an under-studied genre for this task. We examine how context can help with the DRC task, with our experiments showing that context, as defined by discourse structure, is generally helpful. We also present an analysis of which scientific discourse relation types might benefit most from context.



Online Imitation Learning for Manipulation via Decaying Relative Correction through Teleoperation

Pan, Cheng, Cheng, Hung Hon, Hughes, Josie

arXiv.org Artificial Intelligence

Teleoperated robotic manipulators enable the collection of demonstration data, which can be used to train control policies through imitation learning. However, such methods can require significant amounts of training data to develop robust policies or adapt them to new and unseen tasks. While expert feedback can significantly enhance policy performance, providing continuous feedback can be cognitively demanding and time-consuming for experts. To address this challenge, we propose to use a cable-driven teleoperation system which can provide spatial corrections with 6 degree of freedom to the trajectories generated by a policy model. Specifically, we propose a correction method termed Decaying Relative Correction (DRC) which is based upon the spatial offset vector provided by the expert and exists temporarily, and which reduces the intervention steps required by an expert. Our results demonstrate that DRC reduces the required expert intervention rate by 30\% compared to a standard absolute corrective method. Furthermore, we show that integrating DRC within an online imitation learning framework rapidly increases the success rate of manipulation tasks such as raspberry harvesting and cloth wiping.


Distilling Desired Comments for Enhanced Code Review with Large Language Models

Yu, Yongda, Zhang, Lei, Rong, Guoping, Shen, Haifeng, Zhang, Jiahao, Yan, Haoxiang, Shi, Guohao, Shao, Dong, Pan, Ruiqi, Li, Yuan, Wang, Qiushi, Tian, Zhao

arXiv.org Artificial Intelligence

There has been a growing interest in using Large Language Models (LLMs) for code review thanks to their proven proficiency in code comprehension. The primary objective of most review scenarios is to generate desired review comments (DRCs) that explicitly identify issues to trigger code fixes. However, existing LLM-based solutions are not so effective in generating DRCs for various reasons such as hallucination. To enhance their code review ability, they need to be fine-tuned with a customized dataset that is ideally full of DRCs. Nevertheless, such a dataset is not yet available, while manual annotation of DRCs is too laborious to be practical. In this paper, we propose a dataset distillation method, Desiview, which can automatically construct a distilled dataset by identifying DRCs from a code review dataset. Experiments on the CodeReviewer dataset comprising more than 150K review entries show that Desiview achieves an impressive performance of 88.93%, 80.37%, 86.67%, and 84.44% in terms of Precision, Recall, Accuracy, and F1, respectively, surpassing state-of-the-art methods. To validate the effect of such a distilled dataset on enhancing LLMs' code review ability, we first fine-tune the latest LLaMA series (i.e., LLaMA 3 and LLaMA 3.1) to build model Desiview4FT. We then enhance the model training effect through KTO alignment by feeding those review comments identified as non-DRCs to the LLMs, resulting in model Desiview4FA. Verification results indicate that Desiview4FA slightly outperforms Desiview4FT, while both models have significantly improved against the base models in terms of generating DRCs. Human evaluation confirms that both models identify issues more accurately and tend to generate review comments that better describe the issues contained in the code than the base LLMs do.


Planning behavior in a recurrent neural network that plays Sokoban

Garriga-Alonso, Adrià, Taufeeque, Mohammad, Gleave, Adam

arXiv.org Artificial Intelligence

In many tasks, the performance of both humans and some neural networks (NNs) improves with more reasoning: whether by giving a human time to think before making a chess move, or by prompting or training a large language model (LLM) to reason step by step [Kojima et al., 2022, OpenAI, 2024]. Among other reasoning capabilities, goal-oriented reasoning is particularly relevant to AI alignment. So-called "mesa-optimizers" - AIs that have learned to pursue goals through internal reasoning [Hubinger et al., 2019] - may internalize goals different from the training objective, leading to goal misgeneralization [Di Langosco et al., 2022, Shah et al., 2022]. Understanding how NNs learn to plan and represent the objective could be key to detect, prevent or correct goal misgeneralization. In this work, we focus on interpreting a Deep Repeating ConvL-STM [Guez et al., 2019, DRC] trained on Sokoban, a puzzle game often used as a planning benchmark [Peters et al., 2023]. We interpret the best network from Guez et al. [2019], DRC (3, 3), with 3 recurrent layers that are applied 3 times per environment step. Further details of the network are provided in Section 2. We find that its internal plan representation [Bush et al., 2025] is causal, improves with more computation, and that the DRC learns to take advantage of that by often "pacing" to get enough time to refine its internal plan. We show similar results in Appendix B for another DRC network and causal plan representation in a ResNet model.


US 'strongly condemns' violence in DR Congo after alleged drone attack

Al Jazeera

The United States has condemned growing violence in the Democratic Republic of the Congo (DRC), blaming an armed group it says is backed by neighbouring Rwanda. Fighting has flared in recent days in the eastern part of the DRC between the M23 rebel group and government forces, resulting in dozens of soldiers and civilians being killed or wounded. The fighting has also pushed tens of thousands of civilians to flee towards the eastern city of Goma, which is located between Lake Kivu and the border with Rwanda. "This escalation has increased the risk to millions of people already exposed to human rights abuses including displacement, deprivation, and attacks," US State Department spokesman Matthew Miller said in a statement. "The United States condemns Rwanda's support for the M23 armed group and calls on Rwanda to immediately withdraw all Rwanda Defense Force personnel from the DRC and remove its surface-to-air missile systems, which threaten the lives of civilians, UN and other regional peacekeepers, humanitarian actors, and commercial flights in eastern DRC," Miller added.


DR Congo accuses Rwanda of airport 'drone attack' in restive east

Al Jazeera

The Democratic Republic of the Congo has accused Rwanda of carrying out a drone attack that damaged a civilian aircraft at the airport in the strategic eastern city of Goma, the capital of North Kivu province. Fighting has flared in recent days around the town of Sake, 20km (12 miles) from Goma, between M23 rebels – which Kinshasa says are backed by Kigali – and Congolese government forces. "On the night of Friday to Saturday, at 2-o-clock in the morning local time, there was a drone attack by the Rwandan army," said Lieutenant-Colonel Guillaume Ndjike Kaito, army spokesperson for North Kivu province. "It had obviously come from the Rwandan territory, violating the territorial integrity of the Democratic Republic of the Congo," he added in a video broadcast by the governorate. The drones "targeted aircraft of DRC armed forces".


The Distributional Reward Critic Architecture for Perturbed-Reward Reinforcement Learning

Chen, Xi, Zhu, Zhihui, Perrault, Andrew

arXiv.org Artificial Intelligence

We study reinforcement learning in the presence of an unknown reward perturbation. Existing methodologies for this problem make strong assumptions including reward smoothness, known perturbations, and/or perturbations that do not modify the optimal policy. We study the case of unknown arbitrary perturbations that discretize and shuffle reward space, but have the property that the true reward belongs to the most frequently observed class after perturbation. This class of perturbations generalizes existing classes (and, in the limit, all continuous bounded perturbations) and defeats existing methods. We introduce an adaptive distributional reward critic and show theoretically that it can recover the true rewards under technical conditions. Under the targeted perturbation in discrete and continuous control tasks, we win/tie the highest return in 40/57 settings (compared to 16/57 for the best baseline). Even under the untargeted perturbation, we still win an edge over the baseline designed especially for that setting. The use of reward as an objective is a central feature of reinforcement learning (RL) that has been hypothesized to constitute a path to general intelligence Silver et al. (2021). The reward is also the cause of a substantial amount of human effort associated with RL, from engineering to reduce difficulties caused by sparse, delayed, or misspecified rewards Ng et al. (1999); Hadfield-Menell et al. (2017); Qian et al. (2023) to gathering large volumes of human-labeled rewards used for tuning large language models (LLMs) Ouyang et al. (2022); Bai et al. (2022).


Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot Document-Level Question Answering

McDonald, Tavish, Tsan, Brian, Saini, Amar, Ordonez, Juanita, Gutierrez, Luis, Nguyen, Phan, Mason, Blake, Ng, Brenda

arXiv.org Artificial Intelligence

Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.