Goto

Collaborating Authors

 Government


For Those Who May Find Themselves on the Red Team

arXiv.org Artificial Intelligence

This position paper argues that literary scholars must engage with large language model (LLM) interpretability research. While doing so will involve ideological struggle, if not out-right complicity, the necessity of this engagement is clear: the abiding instrumentality of current approaches to interpretability cannot be the only standard by which we measure interpretation with LLMs. One site at which this struggle could take place, I suggest, is the red team.


Natural Emergent Misalignment from Reward Hacking in Production RL

arXiv.org Artificial Intelligence

We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.


Future Is Unevenly Distributed: Forecasting Ability of LLMs Depends on What We're Asking

arXiv.org Artificial Intelligence

Large Language Models (LLMs) demonstrate partial forecasting competence across social, political, and economic events. Y et, their predictive ability varies sharply with domain structure and prompt framing. We investigate how forecasting performance varies with different model families on real-world questions about events that happened beyond the model cutoff date. We analyze how context, question type, and external knowledge affect accuracy and calibration, and how adding factual news context modifies belief formation and failure modes. Our results show that forecasting ability is highly variable as it depends on what, and how, we ask.


OmniStruct: Universal Text-to-Structure Generation across Diverse Schemas

arXiv.org Artificial Intelligence

The ability of Large Language Models (LLMs) to generate structured outputs that follow arbitrary schemas is crucial to a wide range of downstream tasks that require diverse structured representations of results such as information extraction, table generation, and function calling. While modern LLMs excel in generating unstructured responses in natural language, whether this advancement translates to a strong performance on text-to-structure tasks remains unclear. To bridge this gap, we first introduce OmniStruct, a comprehensive benchmark for assessing LLMs' capabilities on diverse text-to-structure tasks such as information extraction, table generation, and function calling. We build OmniStruct by identifying existing datasets across a wide range of tasks that are suitable for a structured answer format, and adapting them under a unified text-to-structure problem setting. To facilitate the development of efficient text-to-structure models, we collect high-quality training data via synthetic task generation. Without using any supervised data for OmniStruct tasks, our experiments demonstrate the possibility of fine-tuning much smaller models on synthetic data into universal structured generation models that can rival the performance of GPT-4o.


Clinician-in-the-Loop Smart Home System to Detect Urinary Tract Infection Flare-Ups via Uncertainty-Aware Decision Support

arXiv.org Artificial Intelligence

Urinary tract infection (UTI) flare-ups pose a significant health risk for older adults with chronic conditions. These infections often go unnoticed until they become severe, making early detection through innovative smart home technologies crucial. Traditional machine learning (ML) approaches relying on simple binary classification for UTI detection offer limited utility to nurses and practitioners as they lack insight into prediction uncertainty, hindering informed clinical decision-making. This paper presents a clinician-in-the-loop (CIL) smart home system that leverages ambient sensor data to extract meaningful behavioral markers, train robust predictive ML models, and calibrate them to enable uncertainty-aware decision support. The system incorporates a statistically valid uncertainty quantification method called Conformal-Calibrated Interval (CCI), which quantifies uncertainty and abstains from making predictions ("I don't know") when the ML model's confidence is low. Evaluated on real-world data from eight smart homes, our method outperforms baseline methods in recall and other classification metrics while maintaining the lowest abstention proportion and interval width. A survey of 42 nurses confirms that our system's outputs are valuable for guiding clinical decision-making, underscoring their practical utility in improving informed decisions and effectively managing UTIs and other condition flare-ups in older adults.


Reduced-Basis Deep Operator Learning for Parametric PDEs with Independently Varying Boundary and Source Data

arXiv.org Artificial Intelligence

Parametric PDEs power modern simulation, design, and digital-twin systems, yet their many-query workloads still hinge on repeatedly solving large finite-element systems. Existing operator-learning approaches accelerate this process but often rely on opaque learned trunks, require extensive labeled data, or break down when boundary and source data vary independently from physical parameters. We introduce RB-DeepONet, a hybrid operator-learning framework that fuses reduced-basis (RB) numerical structure with the branch-trunk architecture of DeepONet. The trunk is fixed to a rigorously constructed RB space generated offline via Greedy selection, granting physical interpretability, stability, and certified error control. The branch network predicts only RB coefficients and is trained label-free using a projected variational residual that targets the RB-Galerkin solution. For problems with independently varying loads or boundary conditions, we develop boundary and source modal encodings that compress exogenous data into low-dimensional coordinates while preserving accuracy. Combined with affine or empirical interpolation decompositions, RB-DeepONet achieves a strict offline-online split: all heavy lifting occurs offline, and online evaluation scales only with the RB dimension rather than the full mesh. We provide convergence guarantees separating RB approximation error from statistical learning error, and numerical experiments show that RB-DeepONet attains accuracy competitive with intrusive RB-Galerkin, POD-DeepONet, and FEONet while using dramatically fewer trainable parameters and achieving significant speedups. This establishes RB-DeepONet as an efficient, stable, and interpretable operator learner for large-scale parametric PDEs.


Can LLMs Help Allocate Public Health Resources? A Case Study on Childhood Lead Testing

arXiv.org Artificial Intelligence

Public health agencies face critical challenges in identifying high-risk neighborhoods for childhood lead exposure with limited resources for outreach and intervention programs. To address this, we develop a Priority Score integrating untested children proportions, elevated blood lead prevalence, and public health coverage patterns to support optimized resource allocation decisions across 136 neighborhoods in Chicago, New York City, and Washington, D.C. We leverage these allocation tasks, which require integrating multiple vulnerability indicators and interpreting empirical evidence, to evaluate whether large language models (LLMs) with agentic reasoning and deep research capabilities can effectively allocate public health resources when presented with structured allocation scenarios. LLMs were tasked with distributing 1,000 test kits within each city based on neighborhood vulnerability indicators. Results reveal significant limitations: LLMs frequently overlooked neighborhoods with highest lead prevalence and largest proportions of untested children, such as West Englewood in Chicago, while allocating disproportionate resources to lower-priority areas like Hunts Point in New York City. Overall accuracy averaged 0.46, reaching a maximum of 0.66 with ChatGPT 5 Deep Research. Despite their marketed deep research capabilities, LLMs struggled with fundamental limitations in information retrieval and evidence-based reasoning, frequently citing outdated data and allowing non-empirical narratives about neighborhood conditions to override quantitative vulnerability indicators.


ProHD: Projection-Based Hausdorff Distance Approximation

arXiv.org Artificial Intelligence

The Hausdorff distance (HD) is a robust measure of set dissimilarity, but computing it exactly on large, high-dimensional datasets is prohibitively expensive. We propose \textbf{ProHD}, a projection-guided approximation algorithm that dramatically accelerates HD computation while maintaining high accuracy. ProHD identifies a small subset of candidate "extreme" points by projecting the data onto a few informative directions (such as the centroid axis and top principal components) and computing the HD on this subset. This approach guarantees an underestimate of the true HD with a bounded additive error and typically achieves results within a few percent of the exact value. In extensive experiments on image, physics, and synthetic datasets (up to two million points in $D=256$), ProHD runs 10--100$\times$ faster than exact algorithms while attaining 5--20$\times$ lower error than random sampling-based approximations. Our method enables practical HD calculations in scenarios like large vector databases and streaming data, where quick and reliable set distance estimation is needed.


Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models

arXiv.org Artificial Intelligence

Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models to answer financial questions using external knowledge bases of U.S. SEC filings, earnings reports, and regulatory documents. However, existing work lacks systematic comparison of vector-based and non-vector RAG architectures for financial documents, and the empirical impact of advanced RAG techniques on retrieval accuracy, answer quality, latency, and cost remain unclear. We present the first systematic evaluation comparing vector-based agentic RAG using hybrid search and metadata filtering against hierarchical node-based systems that traverse document structure without embeddings. We evaluate two enhancement techniques applied to the vector-based architecture, i) cross-encoder reranking for retrieval precision, and ii) small-to-big chunk retrieval for context completeness. Across 1,200 SEC 10-K, 10-Q, and 8-K filings on a 150-question benchmark, we measure retrieval metrics (MRR, Recall@5), answer quality through LLM-as-a-judge pairwise comparisons, latency, and preprocessing costs. Vector-based agentic RAG achieves a 68% win rate over hierarchical node-based systems with comparable latency (5.2 compared to 5.98 seconds). Cross-encoder reranking achieves a 59% absolute improvement at optimal parameters (10, 5) for MRR@5. Small-to-big retrieval achieves a 65% win rate over baseline chunking with only 0.2 seconds additional latency. Our findings reveal that applying advanced RAG techniques to financial Q&A systems improves retrieval accuracy, answer quality, and has cost-performance tradeoffs to be considered in production.


Anti-Jamming based on Null-Steering Antennas and Intelligent UAV Swarm Behavior

arXiv.org Artificial Intelligence

Unmanned Aerial Vehicle (UAV) swarms represent a key advancement in autonomous systems, enabling coordinated missions through inter-UAV communication. However, their reliance on wireless links makes them vulnerable to jamming, which can disrupt coordination and mission success. This work investigates whether a UAV swarm can effectively overcome jamming while maintaining communication and mission efficiency. To address this, a unified optimization framework combining Genetic Algorithms (GA), Supervised Learning (SL), and Reinforcement Learning (RL) is proposed. The mission model, structured into epochs and timeslots, allows dynamic path planning, antenna orientation, and swarm formation while progressively enforcing collision rules. Null-steering antennas enhance resilience by directing antenna nulls toward interference sources. Results show that the GA achieved stable, collision-free trajectories but with high computational cost. SL models replicated GA-based configurations but struggled to generalize under dynamic or constrained settings. RL, trained via Proximal Policy Optimization (PPO), demonstrated adaptability and real-time decision-making with consistent communication and lower computational demand. Additionally, the Adaptive Movement Model generalized UAV motion to arbitrary directions through a rotation-based mechanism, validating the scalability of the proposed system. Overall, UAV swarms equipped with null-steering antennas and guided by intelligent optimization algorithms effectively mitigate jamming while maintaining communication stability, formation cohesion, and collision safety. The proposed framework establishes a unified, flexible, and reproducible basis for future research on resilient swarm communication systems.