Government
Instruction Tuning Chronologically Consistent Language Models
He, Songrun, Lv, Linying, Manela, Asaf, Wu, Jimmy
We introduce a family of chronologically consistent, instruction-tuned large language models to eliminate lookahead bias. Each model is trained only on data available before a clearly defined knowledge-cutoff date, ensuring strict temporal separation from any post-cutoff data. The resulting framework offers (i) a simple, conversational chat interface, (ii) fully open, fixed model weights that guarantee replicability, and (iii) a conservative lower bound on forecast accuracy, isolating the share of predictability that survives once training leakage is removed. Together, these features provide researchers with an easy-to-use generative AI tool useful for a wide range of prediction tasks that is free of lookahead bias.
Read Between the Lines: A Benchmark for Uncovering Political Bias in Bangla News Articles
Lia, Nusrat Jahan, Dipta, Shubhashis Roy, Zehady, Abdullah Khan, Islam, Naymul, Chakraborty, Madhusodan, Wasif, Abdullah Al
Detecting media bias is crucial, specifically in the South Asian region. Despite this, annotated datasets and computational studies for Bangla political bias research remain scarce. Crucially because, political stance detection in Bangla news requires understanding of linguistic cues, cultural context, subtle biases, rhetorical strategies, code-switching, implicit sentiment, and socio-political background. To address this, we introduce the first benchmark dataset of 200 politically significant and highly debated Bangla news articles, labeled for government-leaning, government-critique, and neutral stances, alongside diagnostic analyses for evaluating large language models (LLMs). Our comprehensive evaluation of 28 proprietary and open-source LLMs shows strong performance in detecting government-critique content (F1 up to 0.83) but substantial difficulty with neutral articles (F1 as low as 0.00). Models also tend to over-predict government-leaning stances, often misinterpreting ambiguous narratives. This dataset and its associated diagnostics provide a foundation for advancing stance detection in Bangla media research and offer insights for improving LLM performance in low-resource languages.
Benchmarking LLM Privacy Recognition for Social Robot Decision Making
Sullivan, Dakota, Zhang, Shirley, Li, Jennica, Kirkorian, Heather, Mutlu, Bilge, Fawaz, Kassem
While robots have previously utilized rule-based systems or probabilistic models for user interaction, the rapid evolution of large language models (LLMs) presents new opportunities to develop LLM-powered robots for enhanced human-robot interaction (HRI). To fully realize these capabilities, however, robots need to collect data such as audio, fine-grained images, video, and locations. As a result, LLMs often process sensitive personal information, particularly within private environments, such as homes. Given the tension between utility and privacy risks, evaluating how current LLMs manage sensitive data is critical. Specifically, we aim to explore the extent to which out-of-the-box LLMs are privacy-aware in the context of household robots. In this work, we present a set of privacy-relevant scenarios developed using the Contextual Integrity (CI) framework. We first surveyed users' privacy preferences regarding in-home robot behaviors and then examined how their privacy orientations affected their choices of these behaviors (N = 450). We then provided the same set of scenarios and questions to state-of-the-art LLMs (N = 10) and found that the agreement between humans and LLMs was generally low. To further investigate the capabilities of LLMs as potential privacy controllers, we implemented four additional prompting strategies and compared their results. We discuss the performance of the evaluated models as well as the implications and potential of AI privacy awareness in human-robot interaction.
AutoMalDesc: Large-Scale Script Analysis for Cyber Threat Research
Apostu, Alexandru-Mihai, Preda, Andrei, Damir, Alexandra Daniela, Bolocan, Diana, Ionescu, Radu Tudor, Croitoru, Ioana, Gaman, Mihaela
Generating thorough natural language explanations for threat detections remains an open problem in cybersecurity research, despite significant advances in automated malware detection systems. In this work, we present AutoMalDesc, an automated static analysis summarization framework that, following initial training on a small set of expert-curated examples, operates independently at scale. This approach leverages an iterative self-paced learning pipeline to progressively enhance output quality through synthetic data generation and validation cycles, eliminating the need for extensive manual data annotation. Evaluation across 3,600 diverse samples in five scripting languages demonstrates statistically significant improvements between iterations, showing consistent gains in both summary quality and classification accuracy. Our comprehensive validation approach combines quantitative metrics based on established malware labels with qualitative assessment from both human experts and LLM-based judges, confirming both technical precision and linguistic coherence of generated summaries. To facilitate reproducibility and advance research in this domain, we publish our complete dataset of more than 100K script samples, including annotated seed (0.9K) and test (3.6K)
Whistledown: Combining User-Level Privacy with Conversational Coherence in LLMs
McMurray, Chelsea, Tirmazi, Hayder
Users increasingly rely on large language models (LLMs) for personal, emotionally charged, and socially sensitive conversations. However, prompts sent to cloud-hosted models can contain personally identifiable information (PII) that users do not want logged, retained, or leaked. We observe this to be especially acute when users discuss friends, coworkers, or adversaries, i.e., when they spill the tea. Enterprises face the same challenge when they want to use LLMs for internal communication and decision-making. In this whitepaper, we present Whistledown, a best-effort privacy layer that modifies prompts before they are sent to the LLM. Whistledown combines pseudonymization and $ฮต$-local differential privacy ($ฮต$-LDP) with transformation caching to provide best-effort privacy protection without sacrificing conversational utility. Whistledown is designed to have low compute and memory overhead, allowing it to be deployed directly on a client's device in the case of individual users. For enterprise users, Whistledown is deployed centrally within a zero-trust gateway that runs on an enterprise's trusted infrastructure. Whistledown requires no changes to the existing APIs of popular LLM providers.
Shedding Light on VLN Robustness: A Black-box Framework for Indoor Lighting-based Adversarial Attack
Li, Chenyang, Tang, Wenbing, Huang, Yihao, Zhan, Sinong Simon, Hu, Ming, Jia, Xiaojun, Liu, Yang
Vision-and-Language Navigation (VLN) agents have made remarkable progress, but their robustness remains insufficiently studied. Existing adversarial evaluations often rely on perturbations that manifest as unusual textures rarely encountered in everyday indoor environments. Errors under such contrived conditions have limited practical relevance, as real-world agents are unlikely to encounter such artificial patterns. In this work, we focus on indoor lighting, an intrinsic yet largely overlooked scene attribute that strongly influences navigation. We propose Indoor Lighting-based Adversarial Attack (ILA), a black-box framework that manipulates global illumination to disrupt VLN agents. Motivated by typical household lighting usage, we design two attack modes: Static Indoor Lighting-based Attack (SILA), where the lighting intensity remains constant throughout an episode, and Dynamic Indoor Lighting-based Attack (DILA), where lights are switched on or off at critical moments to induce abrupt illumination changes. We evaluate ILA on two state-of-the-art VLN models across three navigation tasks. Results show that ILA significantly increases failure rates while reducing trajectory efficiency, revealing previously unrecognized vulnerabilities of VLN agents to realistic indoor lighting variations.
Self-Adaptive Graph Mixture of Models
Meena, Mohit, Punjabi, Yash, A, Abhishek, Sharma, Vishal, Chandran, Mahesh
Graph Neural Networks (GNNs) have emerged as powerful tools for learning over graph-structured data, yet recent studies have shown that their performance gains are beginning to plateau. In many cases, well-established models such as GCN and GA T, when appropriately tuned, can match or even exceed the performance of more complex, state-of-the-art architectures. This trend highlights a key limitation in the current landscape: the difficulty of selecting the most suitable model for a given graph task or dataset. To address this, we propose Self-Adaptive Graph Mixture of Models (SAGMM), a modular and practical framework that learns to automatically select and combine the most appropriate GNN models from a diverse pool of architectures. Unlike prior mixture-of-experts approaches that rely on variations of a single base model, SAGMM leverages architectural diversity and a topology-aware attention gating mechanism to adaptively assign experts to each node based on the structure of the input graph. To improve efficiency, SAGMM includes a pruning mechanism that reduces the number of active experts during training and inference without compromising performance. We also explore a training-efficient variant in which expert models are pretrained and frozen, and only the gating and task-specific layers are trained. We evaluate SAGMM on 16 benchmark datasets covering node classification, graph classification, regression, and link prediction tasks, and demonstrate that it consistently outperforms or matches leading GNN baselines and prior mixture-based methods, offering a robust and adaptive solution for real-world graph learning. Code is released at https://github.com/ast-fri/SAGMM.
Bi-View Embedding Fusion: A Hybrid Learning Approach for Knowledge Graph's Nodes Classification Addressing Problems with Limited Data
Napoli, Rosario, Lonia, Giovanni, Celesti, Antonio, Villari, Massimo, Fazio, Maria
Traditional Machine Learning (ML) methods require large amounts of data to perform well, limiting their applicability in sparse or incomplete scenarios and forcing the usage of additional synthetic data to improve the model training. To overcome this challenge, the research community is looking more and more at Graph Machine Learning (GML) as it offers a powerful alternative by using relationships within data. However, this method also faces limitations, particularly when dealing with Knowledge Graphs (KGs), which can hide huge information due to their semantic nature. This study introduces Bi-View, a novel hybrid approach that increases the informative content of node features in KGs to generate enhanced Graph Embeddings (GEs) that are used to improve GML models without relying on additional synthetic data. The proposed work combines two complementary GE techniques: Node2Vec, which captures structural patterns through unsupervised random walks, and GraphSAGE, which aggregates neighbourhood information in a supervised way. Node2Vec embeddings are first computed to represent the graph topology, and node features are then enriched with centrality-based metrics, which are used as input for the GraphSAGE model. Moreover, a fusion layer combines the original Node2Vec embeddings with the GraphSAGE-influenced representations, resulting in a dual-perspective embedding space. Such a fusion captures both topological and semantic properties of the graph, enabling the model to exploit informative features that may exist in the dataset but that are not explicitly represented. Our approach improves downstream task performance, especially in scenarios with poor initial features, giving the basis for more accurate and precise KG-enanched GML models.
AA-Omniscience: Evaluating Cross-Domain Knowledge Reliability in Large Language Models
Jackson, Declan, Keating, William, Cameron, George, Hill-Smith, Micah
We introduce AA-Omniscience, a benchmark designed to measure both factual recall and knowledge calibration across 6,000 questions. Questions are derived from authoritative academic and industry sources, and cover 42 economically relevant topics within six different domains. The evaluation measures a model's Omniscience Index, a bounded metric (-100 to 100) measuring factual recall that jointly penalizes hallucinations and rewards abstention when uncertain, with 0 equating to a model that answers questions correctly as much as it does incorrectly. Among evaluated models, Claude 4.1 Opus attains the highest score (4.8), making it one of only three models to score above zero. These results reveal persistent factuality and calibration weaknesses across frontier models. Performance also varies by domain, with the models from three different research labs leading across the six domains. This performance variability suggests models should be chosen according to the demands of the use case rather than general performance for tasks where knowledge is important.
An Evaluation of Representation Learning Methods in Particle Physics Foundation Models
Chen, Michael, Kansal, Raghav, Gandrakota, Abhijith, Hao, Zichun, Ngadiuba, Jennifer, Spiropulu, Maria
We present a systematic evaluation of representation learning objectives for particle physics within a unified framework. Our study employs a shared transformer-based particle-cloud encoder with standardized preprocessing, matched sampling, and a consistent evaluation protocol on a jet classification dataset. We compare contrastive (supervised and self-supervised), masked particle modeling, and generative reconstruction objectives under a common training regimen. In addition, we introduce targeted supervised architectural modifications that achieve state-of-the-art performance on benchmark evaluations. This controlled comparison isolates the contributions of the learning objective, highlights their respective strengths and limitations, and provides reproducible baselines. We position this work as a reference point for the future development of foundation models in particle physics, enabling more transparent and robust progress across the community.