Government
SmoothGuard: Defending Multimodal Large Language Models with Noise Perturbation and Clustering Aggregation
Su, Guangzhi, Huang, Shuchang, Ke, Yutong, Liu, Zhuohang, Qian, Long, Huang, Kaizhu
Multimodal large language models (MLLMs) have achieved impressive performance across diverse tasks by jointly reasoning over textual and visual inputs. Despite their success, these models remain highly vulnerable to adversarial manipulations, raising concerns about their safety and reliability in deployment. In this work, we first generalize an approach for generating adversarial images within the HuggingFace ecosystem and then introduce SmoothGuard, a lightweight and model-agnostic defense framework that enhances the robustness of MLLMs through randomized noise injection and clustering-based prediction aggregation. Our method perturbs continuous modalities (e.g., images and audio) with Gaussian noise, generates multiple candidate outputs, and applies embedding-based clustering to filter out adversarially influenced predictions. The final answer is selected from the majority cluster, ensuring stable responses even under malicious perturbations. Extensive experiments on POPE, LLaVA-Bench (In-the-Wild), and MM-SafetyBench demonstrate that SmoothGuard improves resilience to adversarial attacks while maintaining competitive utility. Ablation studies further identify an optimal noise range (0.1-0.2) that balances robustness and utility.
Systematic Absence of Low-Confidence Nighttime Fire Detections in VIIRS Active Fire Product: Evidence of Undocumented Algorithmic Filtering
The Visible Infrared Imaging Radiometer Suite (VIIRS) active fire product is widely used for global fire monitoring, yet its confidence classification scheme exhibits an undocumented systematic pattern. Through analysis of 21,540,921 fire detections spanning one year (January 2023 - January 2024), I demonstrate a complete absence of low-confidence classifications during nighttime observations. Of 6,007,831 nighttime fires, zero were classified as low confidence, compared to an expected 696,908 under statistical independence (chi-squared = 1,474,795, p < 10^-15, Z = -833). This pattern persists globally across all months, latitude bands, and both NOAA-20 and Suomi-NPP satellites. Machine learning reverse-engineering (88.9% accuracy), bootstrap simulation (1,000 iterations), and spatial-temporal analysis confirm this is an algorithmic constraint rather than a geophysical phenomenon. Brightness temperature analysis reveals nighttime fires below approximately 295K are likely excluded entirely rather than flagged as low-confidence, while daytime fires show normal confidence distributions. This undocumented behavior affects 27.9% of all VIIRS fire detections and has significant implications for fire risk assessment, day-night detection comparisons, confidence-weighted analyses, and any research treating confidence levels as uncertainty metrics. I recommend explicit documentation of this algorithmic constraint in VIIRS user guides and reprocessing strategies for affected analyses.
SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Gu, Ken, Bhat, Advait, Merrill, Mike A, West, Robert, Liu, Xin, McDuff, Daniel, Althoff, Tim
Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources
Ji, Haichao, Wang, Zibo, Pan, Cheng, Han, Meng, Zhu, Yifei, Wang, Dan, Han, Zhu
Abstract--Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (F A) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAF A, the first system that integrates LLM-agent-based data analytics with F A. LAF A introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable F A workflows. T o improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAF A consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive F A operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the F A setting. The rapid development of Large Language Models (LLMs) has offered unprecedented capabilities in natural language understanding, reasoning, and planning [1], significantly transforming the landscape of data analytics. LLMs can interpret complex analytical intents, generate structured code, and orchestrate multi-step tasks by interacting with external environments such as databases and computation sandboxes. These capabilities have led to the emergence of LLM-based agents that decompose high-level queries, plan analytical workflows, and execute or verify results through tool interactions.
Artificially intelligent agents in the social and behavioral sciences: A history and outlook
Holme, Petter, Tsvetkova, Milena
We review the historical development and current trends of artificially intelligent agents (agentic AI) in the social and behavioral sciences: from the first programmable computers, and social simulations soon thereafter, to today's experiments with large language models. This overview emphasizes the role of AI in the scientific process and the changes brought about, both through technological advancements and the broader evolution of science from around 1950 to the present. Some of the specific points we cover include: the challenges of presenting the first social simulation studies to a world unaware of computers, the rise of social systems science, intelligent game theoretic agents, the age of big data and the epistemic upheaval in its wake, and the current enthusiasm around applications of generative AI, and many other topics. A pervasive theme is how deeply entwined we are with the technologies we use to understand ourselves.
Discrete Diffusion Models: Novel Analysis and New Sampler Guarantees
Liang, Yuchen, Liang, Yingbin, Lai, Lifeng, Shroff, Ness
Discrete diffusion models have recently gained significant prominence in applications involving natural language and graph data. A key factor influencing their effectiveness is the efficiency of discretized samplers. Among these, $ฯ$-leaping samplers have become particularly popular due to their theoretical and empirical success. However, existing theoretical analyses of $ฯ$-leaping often rely on somewhat restrictive and difficult-to-verify regularity assumptions, and their convergence bounds contain quadratic dependence on the vocabulary size. In this work, we introduce a new analytical approach for discrete diffusion models that removes the need for such assumptions. For the standard $ฯ$-leaping method, we establish convergence guarantees in KL divergence that scale linearly with vocabulary size, improving upon prior results with quadratic dependence. Our approach is also more broadly applicable: it provides the first convergence guarantees for other widely used samplers, including the Euler method and Tweedie $ฯ$-leaping. Central to our approach is a novel technique based on differential inequalities, offering a more flexible alternative to the traditional Girsanov change-of-measure methods. This technique may also be of independent interest for the analysis of other stochastic processes.
Algorithmic Collusion of Pricing and Advertising on E-commerce Platforms
When online sellers use AI learning algorithms to automatically compete on e-commerce platforms, there is concern that they will learn to coordinate on higher than competitive prices. However, this concern was primarily raised in single-dimension price competition. We investigate whether this prediction holds when sellers make pricing and advertising decisions together, i.e., two-dimensional decisions. We analyze competition in multi-agent reinforcement learning, and use a large-scale dataset from Amazon.com to provide empirical evidence. We show that when consumers have high search costs, learning algorithms can coordinate on prices lower than competitive prices, facilitating a win-win-win for consumers, sellers, and platforms. This occurs because algorithms learn to coordinate on lower advertising bids, which lower advertising costs, leading to lower prices and enlarging demand on the platform. We also show that our results generalize to any learning algorithm that uses exploration of price and advertising bids. Consistent with our predictions, an empirical analysis shows that price levels exhibit a negative interaction between estimated consumer search costs and algorithm usage index. We analyze the platform's strategic response and find that reserve price adjustments will not increase platform profits, but commission adjustments will, while maintaining the beneficial outcomes for both sellers and consumers.
SafePLUG: Empowering Multimodal LLMs with Pixel-Level Insight and Temporal Grounding for Traffic Accident Understanding
Sheng, Zihao, Huang, Zilin, Qu, Yansong, Chen, Jiancong, Luo, Yuhao, Chen, Yen-Jung, Leng, Yue, Chen, Sikai
Multimodal large language models (MLLMs) have achieved remarkable progress across a range of vision-language tasks and demonstrate strong potential for traffic accident understanding. However, existing MLLMs in this domain primarily focus on coarse-grained image-level or video-level comprehension and often struggle to handle fine-grained visual details or localized scene components, limiting their applicability in complex accident scenarios. To address these limitations, we propose SafePLUG, a novel framework that empowers MLLMs with both Pixel-Level Understanding and temporal Grounding for comprehensive traffic accident analysis. SafePLUG supports both arbitrary-shaped visual prompts for region-aware question answering and pixel-level segmentation based on language instructions, while also enabling the recognition of temporally anchored events in traffic accident scenarios. To advance the development of MLLMs for traffic accident understanding, we curate a new dataset containing multimodal question-answer pairs centered on diverse accident scenarios, with detailed pixel-level annotations and temporal event boundaries. Experimental results show that SafePLUG achieves strong performance on multiple tasks, including region-based question answering, pixel-level segmentation, temporal event localization, and accident event understanding. These capabilities lay a foundation for fine-grained understanding of complex traffic scenes, with the potential to improve driving safety and enhance situational awareness in smart transportation systems. The code, dataset, and model checkpoints will be made publicly available at: https://zihaosheng.github.io/SafePLUG
Multilingual Political Views of Large Language Models: Identification and Steering
Gurgurov, Daniil, Trinley, Katharina, Vykopal, Ivan, van Genabith, Josef, Ostermann, Simon, Zamparelli, Roberto
Large language models (LLMs) are increasingly used in everyday tools and applications, raising concerns about their potential influence on political views. While prior research has shown that LLMs often exhibit measurable political biases--frequently skewing toward liberal or progressive positions--key gaps remain. Most existing studies evaluate only a narrow set of models and languages, leaving open questions about the generalizability of political biases across architectures, scales, and multilingual settings. Moreover, few works examine whether these biases can be actively controlled. In this work, we address these gaps through a large-scale study of political orientation in modern open-source instruction-tuned LLMs. We evaluate seven models, including LLaMA-3.1, Qwen-3, and Aya-Expanse, across 14 languages using the Political Compass Test with 11 semantically equivalent paraphrases per statement to ensure robust measurement. Our results reveal that larger models consistently shift toward libertarian-left positions, with significant variations across languages and model families. To test the manipulability of political stances, we utilize a simple center-of-mass activation intervention technique and show that it reliably steers model responses toward alternative ideological positions across multiple languages. Our code is publicly available at https://github.com/d-gurgurov/Political-Ideologies-LLMs.
TempoPFN: Synthetic Pre-training of Linear RNNs for Zero-shot Time Series Forecasting
Moroshan, Vladyslav, Siems, Julien, Zela, Arber, Carstensen, Timur, Hutter, Frank
This paper presents TempoPFN, a univariate time series foundation model based on linear Recurrent Neural Networks (RNNs) pre-trained exclusively on synthetic data. The model uses a GatedDeltaProduct architecture with state-weaving for fully parallelizable training across sequence lengths, eliminating the need for windowing or summarization techniques while maintaining robust temporal state-tracking. Our comprehensive synthetic data pipeline unifies diverse generators--including stochastic differential equations, Gaussian processes, and audio synthesis--with novel augmentations. In zero-shot evaluations on the Gift-Eval benchmark, TempoPFN achieves top-tier competitive performance, outperforming all existing synthetic-only approaches and surpassing the majority of models trained on real-world data, while being more efficient than existing baselines by leveraging fully paralleliz-able training and inference. Recent advances in large language models have inspired foundation models for time series forecasting that enable zero-shot predictions across diverse datasets without fine-tuning (Ansari et al., 2024; Das et al., 2024; Woo et al., 2024; Auer et al., 2025). By treating historical observations as input context, these models democratize forecasting for non-experts and excel in data-scarce domains. However, current approaches face critical limitations. While non-linear RNNs like those in TiReX (Auer et al., 2025) maintain temporal state, they require sequential processing that limits scalability. Although some recent models attempt synthetic-only pre-training including ForecastPFN (Dooley et al., 2023), CauKer (Xie et al., 2024), and Mamba4Cast (Bhethanabhotla & Swelam, 2024) none reported state-of-the-art performance on the Gift-Eval benchmark. TabPFN-TS (Hoo et al., 2024), which adapts a tabular foundation model to time series, achieves strong Gift-Eval performance but does not release its synthetic pre-training data, limiting reproducibility and extensibility. Figure 1: (Left) Synthetic Data Generation pipeline containing a mix of novel and existing time-series generators are augmented with a diverse set of augmentations to produce the time-series used for training. We introduce T empoPFN (see Table 1 and Figure 1), a time series forecasting foundation model using linear RNNs with GatedDeltaProduct recurrence (Siems et al., 2025) for parallelizable training and inference across the sequence length. Unlike TiRex (Auer et al., 2025) which argued that non-linear RNNs like sLSTM are necessary for time-series forecasting due to their state-tracking capabilities we find that linear RNNs based on the GatedDeltaProduct recurrence are sufficient, in line with recent research demonstrating how linear RNNs can perform state-tracking (Grazzi et al., 2025).