Goto

Collaborating Authors

 longformer




Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Broy, Frederik, Züfle, Maike, Niehues, Jan

arXiv.org Artificial Intelligence

Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.


REOrdering Patches Improves Vision Models

Kutscher, Declan, Chan, David M., Bai, Yutong, Darrell, Trevor, Gupta, Ritwik

arXiv.org Artificial Intelligence

Sequence models such as transformers require inputs to be represented as one-dimensional sequences. In vision, this typically involves flattening images using a fixed row-major (raster-scan) order. While full self-attention is permutation-equivariant, modern long-sequence transformers increasingly rely on architectural approximations that break this invariance and introduce sensitivity to patch ordering. We show that patch order significantly affects model performance in such settings, with simple alternatives like column-major or Hilbert curves yielding notable accuracy shifts. Motivated by this, we propose REOrder, a two-stage framework for discovering task-optimal patch orderings. First, we derive an information-theoretic prior by evaluating the compressibility of various patch sequences. Then, we learn a policy over permutations by optimizing a Plackett-Luce policy using REINFORCE. This approach enables efficient learning in a combinatorial permutation space. REOrder improves top-1 accuracy over row-major ordering on ImageNet-1K by up to 3.01% and Functional Map of the World by 13.35%.


DrDiff: Dynamic Routing Diffusion with Hierarchical Attention for Breaking the Efficiency-Quality Trade-off

Zhang, Jusheng, Fan, Yijia, Cai, Kaitong, Huang, Zimeng, Sun, Xiaofei, Wang, Jian, Tang, Chengpei, Wang, Keze

arXiv.org Artificial Intelligence

This paper introduces DrDiff, a novel framework for long-text generation that overcomes the efficiency-quality trade-off through three core technologies. First, we design a dynamic expert scheduling mechanism that intelligently allocates computational resources during the diffusion process based on text complexity, enabling more efficient handling of text generation tasks of varying difficulty. Second, we introduce a Hierarchical Sparse Attention (HSA) mechanism that adaptively adjusts attention patterns according to a variety of input lengths, reducing computational complexity from O($n^2$) to O($n$) while maintaining model performance. Finally, we propose a soft absorption guidance optimization strategy that combines with DPM-solver++ to reduce diffusion steps, significantly improving generation speed. Comprehensive experiments on various long-text generation benchmarks demonstrate the superiority of our DrDiff over the existing SOTA methods.


47d40767c7e9df50249ebfd9c7cfff77-AuthorFeedback.pdf

Neural Information Processing Systems

We thank the reviewers for their valuable comments! Unclear if the proposed method is better than only using LSH. Thank you for the suggestions. ALSH significantly outperforms the E2LSH and the Reformer LSH scheme. SMYRF-BERT base (see also Table 2).


Beyond Token Limits: Assessing Language Model Performance on Long Text Classification

Sebők, Miklós, Kovács, Viktor, Bánóczy, Martin, Eriksen, Daniel Møller, Neptune, Nathalie, Roussille, Philippe

arXiv.org Artificial Intelligence

The most widely used large language models in the social sciences (such as BERT, and its derivatives, e.g. RoBERTa) have a limitation on the input text length that they can process to produce predictions. This is a particularly pressing issue for some classification tasks, where the aim is to handle long input texts. One such area deals with laws and draft laws (bills), which can have a length of multiple hundred pages and, therefore, are not particularly amenable for processing with models that can only handle e.g. 512 tokens. In this paper, we show results from experiments covering 5 languages with XLM-RoBERTa, Longformer, GPT-3.5, GPT-4 models for the multiclass classification task of the Comparative Agendas Project, which has a codebook of 21 policy topic labels from education to health care. Results show no particular advantage for the Longformer model, pre-trained specifically for the purposes of handling long inputs. The comparison between the GPT variants and the best-performing open model yielded an edge for the latter. An analysis of class-level factors points to the importance of support and substance overlaps between specific categories when it comes to performance on long text inputs.


Detection of security smells in IaC scripts through semantics-aware code and language processing

War, Aicha, Rawass, Adnan A., Kabore, Abdoul K., Samhi, Jordan, Klein, Jacques, Bissyande, Tegawende F.

arXiv.org Artificial Intelligence

Infrastructure as Code (IaC) automates the provisioning and management of IT infrastructure through scripts and tools, streamlining software deployment. Prior studies have shown that IaC scripts often contain recurring security misconfigurations, and several detection and mitigation approaches have been proposed. Most of these rely on static analysis, using statistical code representations or Machine Learning (ML) classifiers to distinguish insecure configurations from safe code. In this work, we introduce a novel approach that enhances static analysis with semantic understanding by jointly leveraging natural language and code representations. Our method builds on two complementary ML models: CodeBERT, to capture semantics across code and text, and LongFormer, to represent long IaC scripts without losing contextual information. We evaluate our approach on misconfiguration datasets from two widely used IaC tools, Ansible and Puppet. To validate its effectiveness, we conduct two ablation studies (removing code text from the natural language input, and truncating scripts to reduce context) and compare against four large language models (LLMs) and prior work. Results show that semantic enrichment substantially improves detection, raising precision and recall from 0.46 and 0.79 to 0.92 and 0.88 on Ansible, and from 0.55 and 0.97 to 0.87 and 0.75 on Puppet, respectively.


Scaling Legal AI: Benchmarking Mamba and Transformers for Statutory Classification and Case Law Retrieval

Maurya, Anuraj

arXiv.org Artificial Intelligence

The rapid growth of statutory corpora and judicial decisions requires scalable legal AI systems capable of classification and retrieval over extremely long contexts. Transformer-based architectures (e.g., Longformer, DeBERTa) dominate current legal NLP benchmarks but struggle with quadratic attention costs, limiting efficiency and scalability. In this work, we present the first comprehensive benchmarking of Mamba, a state-space model (SSM) with linear-time selective mechanisms, against leading transformer models for statutory classification and case law retrieval. We evaluate models on open-source legal corpora including LexGLUE, EUR-Lex, and ILDC, covering statutory tagging, judicial outcome prediction, and case retrieval tasks. Metrics include accuracy, recall at k, mean reciprocal rank (MRR), and normalized discounted cumulative gain (nDCG), alongside throughput measured in tokens per second and maximum context length. Results show that Mamba's linear scaling enables processing of legal documents several times longer than transformers, while maintaining or surpassing retrieval and classification performance. This study introduces a new legal NLP benchmark suite for long-context modeling, along with open-source code and datasets to support reproducibility. Our findings highlight trade-offs between state-space models and transformers, providing guidance for deploying scalable legal AI in statutory analysis, judicial decision support, and policy research.


Enhancing Abstractive Summarization of Scientific Papers Using Structure Information

Bao, Tong, Zhang, Heng, Zhang, Chengzhi

arXiv.org Artificial Intelligence

The code and dataset can be accessed at https://github.com/tongbao96/code - for - SFR - AS 1. Introduction W ith the rapid growth of scientific research and the academic community, numerous scientific papers are published daily. This notable increase in publications has led to information overload and requiring schol a r s to spend considerable time in reading and comprehending a large volume of articles . The goal of automatic summarization is to employ algorithms to extract key information and reorganize it into shorter, concise summaries (El - Kassas et al., 2021) . Automatic summarization holds significant research value in fields such as information retrieval (Spina et al., 2017), question and answer system (Y ulianti et al., 2018), and content review (Hu et al., 2017) . Existing a utomatic summarization methods are broadly divided into two categories: extractive method s and abstractive methods . Extractive methods generate summaries by selecting sentences directly from the original document, resulting in summaries that are more accurate and semantically consistent but may lack coherence. In contrast, abstractive methods generate summaries based on an understanding of the text, rather than extracting sentences directly from the original document . Therefore, summaries produced by this approach are typically more coherent and better aligned with human reading preferences (El - Kassas et al., 2021; Ghadimi & Beigy, 2022) . In this paper, we focus on abstractive summarization .