D'Arcy, Mike
MARG: Multi-Agent Review Generation for Scientific Papers
D'Arcy, Mike, Hope, Tom, Birnbaum, Larry, Downey, Doug
We study the ability of LLMs to generate feedback for scientific papers and develop MARG, a feedback generation approach using multiple LLM instances that engage in internal discussion. By distributing paper text across agents, MARG can consume the full text of papers beyond the input length limitations of the base LLM, and by specializing agents and incorporating sub-tasks tailored to different comment types (experiments, clarity, impact) it improves the helpfulness and specificity of feedback. In a user study, baseline methods using GPT-4 were rated as producing generic or very generic comments more than half the time, and only 1.7 comments per paper were rated as good overall in the best baseline. Our system substantially improves the ability of GPT-4 to generate specific and helpful feedback, reducing the rate of generic comments from 60% to 29% and generating 3.7 good comments per paper (a 2.2x improvement).
SciRepEval: A Multi-Format Benchmark for Scientific Document Representations
Singh, Amanpreet, D'Arcy, Mike, Cohan, Arman, Downey, Doug, Feldman, Sergey
Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 24 challenging and realistic tasks, 8 of which are new, across four formats: classification, regression, ranking and search. We then use this benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters and find they outperform the existing single-embedding state-of-the-art by over 2 points absolute. We release the resulting family of multi-format models, called SPECTER2, for the community to use and build on.
ARIES: A Corpus of Scientific Paper Edits Made in Response to Peer Reviews
D'Arcy, Mike, Ross, Alexis, Bransom, Erin, Kuehl, Bailey, Bragg, Jonathan, Hope, Tom, Downey, Doug
Revising scientific papers based on peer feedback is a challenging task that requires not only deep scientific knowledge and reasoning, but also the ability to recognize the implicit requests in high-level feedback and to choose the best of many possible ways to update the manuscript in response. We introduce this task for large language models and release ARIES, a dataset of review comments and their corresponding paper edits, to enable training and evaluating models. We study two versions of the task: comment-edit alignment and edit generation, and evaluate several baselines, including GPT-4. We find that models struggle even to identify the edits that correspond to a comment, especially in cases where the comment is phrased in an indirect way or where the edit addresses the spirit of a comment but not the precise request. When tasked with generating edits, GPT-4 often succeeds in addressing comments on a surface level, but it rigidly follows the wording of the feedback rather than the underlying intent, and includes fewer technical details than human-written edits. We hope that our formalization, dataset, and analysis will form a foundation for future work in this area.
Embedding Recycling for Language Models
Saad-Falcon, Jon, Singh, Amanpreet, Soldaini, Luca, D'Arcy, Mike, Cohan, Arman, Downey, Doug
Real-world applications of neural language models often involve running many different models over the same corpus. The high computational cost of these runs has led to interest in techniques that can reuse the contextualized embeddings produced in previous runs to speed training and inference of future ones. We refer to this approach as embedding recycling (ER). While multiple ER techniques have been proposed, their practical effectiveness is still unknown because existing evaluations consider very few models and do not adequately account for overhead costs. We perform an extensive evaluation of ER across eight different models (17 to 900 million parameters) and fourteen tasks in English. We show how a simple ER technique that caches activations from an intermediate layer of a pretrained model, and learns task-specific adapters on the later layers, is broadly effective. For the best-performing baseline in our experiments (DeBERTa-v2 XL), adding a precomputed cache results in a >90% speedup during training and 87-91% speedup for inference, with negligible impact on accuracy. Our analysis reveals important areas of future work.
DeepMoTIon: Learning to Navigate Like Humans
Hamandi, Mahmoud, D'Arcy, Mike, Fazli, Pooyan
We present a novel human-aware navigation approach, where the robot learns to mimic humans to navigate safely in crowds. The presented model referred to as DeepMoTIon, is trained with pedestrian surveillance data to predict human velocity. The robot processes LiDAR scans via the trained network to navigate to the target location. We conduct extensive experiments to assess the different components of our network and prove the necessity of each to imitate humans. Our experiments show that DeepMoTIon outperforms state-of-the-art in terms of human imitation and reaches the target on 100% of the test cases without breaching humans' safe distance.