mitigation
Mitigating Social Bias in English and Urdu Language Models Using PRM-Guided Candidate Selection and Sequential Refinement
Large language models (LLMs) increasingly mediate human communication, decision support, content creation, and information retrieval. Despite impressive fluency, these systems frequently produce biased or stereotypical content, especially when prompted with socially sensitive language. A growing body of research has demonstrated that such biases disproportionately affect low-resource languages, where training data is limited and culturally unrepresentative. This paper presents a comprehensive study of inference-time bias mitigation, a strategy that avoids retraining or fine-tuning and instead operates directly on model outputs. Building on preference-ranking models (PRMs), we introduce a unified evaluation framework comparing three methods: (1) baseline single-word generation, (2) PRM-Select best-of-N sampling, and (3) PRM-Sequential refinement guided by PRM critiques. We evaluate these techniques across 200 English prompts and their Urdu counterparts, designed to reflect socio-cultural contexts relevant to gender, ethnicity, religion, nationality, disability, profession, age, and socioeconomic categories. Using GPT-3.5 as a candidate generator and GPT-4o-mini as a PRM-based bias and utility scorer, we provide an extensive quantitative analysis of bias reduction, utility preservation, and cross-lingual disparities. Our findings show: (a) substantial gains over the baseline for both languages; (b) consistently lower fairness scores for Urdu across all methods, highlighting structural inequities in multilingual LLM training; and (c) distinct improvement trajectories between PRM-Select and PRM-Sequential. The study contributes an extensible methodology, interpretable metrics, and cross-lingual comparisons that can support future work on fairness evaluation in low-resource languages.
- North America > United States (0.04)
- Asia > Pakistan > Punjab > Lahore Division > Lahore (0.04)
Toward Faithful Retrieval-Augmented Generation with Sparse Autoencoders
Xiong, Guangzhi, He, Zhenghao, Liu, Bohan, Sinha, Sanchit, Zhang, Aidong
Retrieval-Augmented Generation (RAG) improves the factuality of large language models (LLMs) by grounding outputs in retrieved evidence, but faithfulness failures, where generations contradict or extend beyond the provided sources, remain a critical challenge. Existing hallucination detection methods for RAG often rely either on large-scale detector training, which requires substantial annotated data, or on querying external LLM judges, which leads to high inference costs. Although some approaches attempt to leverage internal representations of LLMs for hallucination detection, their accuracy remains limited. Motivated by recent advances in mechanistic interpretability, we employ sparse autoencoders (SAEs) to disentangle internal activations, successfully identifying features that are specifically triggered during RAG hallucinations. Building on a systematic pipeline of information-based feature selection and additive feature modeling, we introduce RAGLens, a lightweight hallucination detector that accurately flags unfaithful RAG outputs using LLM internal representations. RAGLens not only achieves superior detection performance compared to existing methods, but also provides interpretable rationales for its decisions, enabling effective post-hoc mitigation of unfaithful RAG. Finally, we justify our design choices and reveal new insights into the distribution of hallucination-related signals within LLMs. The code is available at https://github.com/Teddy-XiongGZ/RAGLens.
- Europe > Austria > Vienna (0.14)
- Asia > Middle East > Iraq (0.04)
- North America > United States > Virginia (0.04)
- (8 more...)
- Government > Regional Government > North America Government > United States Government (0.92)
- Government > Military (0.67)
The SMART+ Framework for AI Systems
Kandikatla, Laxmiraju, Radeljic, Branislav
Artificial Intelligence (AI) systems are now an integral part of multiple industries. In clinical research, AI supports automated adverse event detection in clinical trials, patient eligibility screening for protocol enrollment, and data quality validation. Beyond healthcare, AI is transforming finance through real-time fraud detection, automated loan risk assessment, and algorithmic decision-making. Similarly, in manufacturing, AI enables predictive maintenance to reduce equipment downtime, enhances quality control through computer-vision inspection, and optimizes production workflows using real-time operational data. While these technologies enhance operational efficiency, they introduce new challenges regarding safety, accountability, and regulatory compliance. To address these concerns, we introduce the SMART+ Framework - a structured model built on the pillars of Safety, Monitoring, Accountability, Reliability, and Transparency, and further enhanced with Privacy & Security, Data Governance, Fairness & Bias, and Guardrails. SMART+ offers a practical, comprehensive approach to evaluating and governing AI systems across industries. This framework aligns with evolving mechanisms and regulatory guidance to integrate operational safeguards, oversight procedures, and strengthened privacy and governance controls. SMART+ demonstrates risk mitigation, trust-building, and compliance readiness. By enabling responsible AI adoption and ensuring auditability, SMART+ provides a robust foundation for effective AI governance in clinical research.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > New Jersey > Middlesex County > Edison (0.04)
- Africa > Zambia > Southern Province > Choma (0.04)
- Research Report > Experimental Study (0.88)
- Research Report > New Finding (0.74)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Health & Medicine > Pharmaceuticals & Biotechnology (0.86)
JaGuard: Jamming Correction of GNSS Deviation with Deep Temporal Graphs
Kesić, Ivana, Blatnik, Aljaž, Fortuna, Carolina, Bertalanič, Blaž
Abstract--Global Navigation Satellite Systems (GNSS) face growing disruption from intentional jamming, undermining availability exactly when reliable positioning and timing are essential. We tackle this challenge by recasting jamming mitigation as a dynamic graph regression problem and propose a Jamming Guardian (JaGuard), a new receiver-centric deep temporal graph network-based method that estimates, and thereby corrects, the receiver's latitude and longitude errors. At each 1 Hz epoch, we model the satellite-receiver scene as a heterogeneous star graph with the receiver as the center node and the tracked satellites as leaves. These satellites have time-varying attributes such as SNR, azimuth, elevation, and latitude/longitude. A single-layer Heterogeneous Graph ConvLSTM (HeteroGCLSTM) fuses one-hop spatial context with short-term temporal dynamics to produce a 2D deviation vector for error mitigation. We evaluate our approach on datasets collected from physical hardware (two different commercial receivers), subjected to controlled conducted RF interference. Interference is introduced with three jammer types: Continuous Wave CW, multi-tone 3 CW, and wideband FM. Each jammer type was exercised at six power levels from 45 to 70 dBm, with 50 repetitions per scenario, including pre-jam, jam, and recovery phases. Compared to strong multivariate time series baselines (TSMixer MLP, uniform CNN, and Seq2Point CNN), our model consistently yields the lowest Mean Absolute Error (MAE) in positional deviation. Under severe jamming at 45 dBm, it achieves an MAE of 3.64-7.74 On mixed-mode datasets that pool all power levels, the MAE is 3.78 cm for GP01 and 4.25 cm for U-blox 10, surpassing Seq2Point, TSMixer, and uniform CNN. A data-efficiency split further shows that with only 10% of the training data, our approach remains clearly ahead, achieving an MAE of about 20 cm versus 36-42 cm for the baselines. Global Navigation Satellite Systems (GNSS) underpin nearly every critical infrastructure, from telecommunications [1] and aviation safety [2], power-grid synchronization [3], emerging drone ecosystems where location privacy and integrity are paramount [4], to autonomous driving [5].
- Europe > United Kingdom > England > Greater London > London (0.04)
- Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
- Asia > India (0.04)
- Transportation (0.68)
- Information Technology (0.66)
The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models
Anderson, Joshua Wolff, Visweswaran, Shyam
Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Oceania > Guam (0.04)
- North America > United States > Alaska (0.04)
- (2 more...)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.68)
A Concise Review of Hallucinations in LLMs and their Mitigation
Pulkundwar, Parth, Dhanawade, Vivek, Yadav, Rohit, Sonkar, Minal, Asurlekar, Medha, Rathod, Sarita
Abstract--Traditional language models face a challenge from hallucinations. Their very presence casts a large, dangerous shadow over the promising realm of natural language processing. It becomes crucial to understand the various kinds of hallucinations that occur nowadays, their origins, and ways of reducing them. This document provides a concise and straightforward summary of that. It serves as a one-stop resource for a general understanding of hallucinations and how to mitigate them. In the fast-moving world of Natural Language Processing (NLP) today, large language models (LLMs) such as GPT, BERT, and others have become the principal agents of change in natural language processing. They can generate human-like text, answer multifaceted questions, or engage in conversation with as much fluency.
- Asia > India > Maharashtra > Mumbai (0.05)
- North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
- North America > United States > California > Orange County > Laguna Hills (0.04)
- (6 more...)
Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings
Akbarian, Fatemeh, Baninajjar, Anahita, Zhang, Yingyi, Balashankar, Ananth, Aminifar, Amir
Abstract--Multi-modal foundation models align images, text, and other modalities in a shared embedding space but remain vulnerable to adversarial illusions [35], where imperceptible perturbations disrupt cross-modal alignment and mislead downstream tasks. T o counteract the effects of adversarial illusions, we propose a task-agnostic mitigation mechanism that reconstructs the input from the attacker's perturbed input through generative models, e.g., V ariational Autoencoders (V AEs), to maintain natural alignment. T o further enhance our proposed defense mechanism, we adopt a generative sampling strategy combined with a consensus-based aggregation scheme over the outcomes of the generated samples. Our experiments on the state-of-the-art multi-modal encoders show that our approach substantially reduces the illusion attack success rates to near-zero and improves cross-modal alignment by 4% (42 46) and 11% (32 43) in unperturbed and perturbed input settings respectively, providing an effective and model-agnostic defense against adversarial illusions. Multi-modal foundation models have rapidly advanced the frontier of visual and linguistic understanding. Foundation models such as CLIP [19], ALIGN [11], and ImageBind [8] align a variety of heterogeneous modalities including images, text, and other modalities within a shared embedding space, thereby enabling zero-shot classification, cross-modal retrieval, and generative conditioning. The shared embedding space that underpins cross-modal flexibility simultaneously introduces a new attack surface, giving rise to adversarial illusions [35]. As downstream tasks directly rely on the integrity of this shared representation, even small perturbations in one modality can induce semantic misalignment across others, misleading models that depend on the embedding for retrieval, captioning, or generative conditioning. Defending against such cross-modal attacks presents unique challenges.
- North America > United States (0.04)
- Europe > Sweden (0.04)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
Prompt Fairness: Sub-group Disparities in LLMs
Zhong, Meiyu, Teku, Noel, Tandon, Ravi
Large Language Models (LLMs), though shown to be effective in many applications, can vary significantly in their response quality. In this paper, we investigate this problem of prompt fairness: specifically, the phrasing of a prompt by different users/styles, despite the same question being asked in principle, may elicit different responses from an LLM. To quantify this disparity, we propose to use information-theoretic metrics that can capture two dimensions of bias: subgroup sensitivity, the variability of responses within a subgroup and cross group consistency, the variability of responses across subgroups. Our analysis reveals that certain subgroups exhibit both higher internal variability and greater divergence from others. Our empirical analysis reveals that certain demographic sub groups experience both higher internal variability and greater divergence from others, indicating structural inequities in model behavior. To mitigate these disparities, we propose practical interventions, including majority voting across multiple generations and prompt neutralization, which together improve response stability and enhance fairness across user populations. In the experiments, we observe clear prompt sensitivity disparities across demographic subgroups: before mitigation, cross-group divergence values reach 0.28 and typically fall in the from 0.14 to 0.22 range. After applying our neutralization and multi generation strategy, these divergences consistently decrease, with the largest gap reduced to 0.22 and many distances falling to 0.17 or below, indicating more stable and consistent outputs across subgroups.
- North America > United States > Arizona (0.04)
- North America > United States > Michigan (0.04)
SafeFall: Learning Protective Control for Humanoid Robots
Meng, Ziyu, Liu, Tengyu, Ma, Le, Wu, Yingying, Song, Ran, Zhang, Wei, Huang, Siyuan
Bipedal locomotion makes humanoid robots inherently prone to falls, causing catastrophic damage to the expensive sensors, actuators, and structural components of full-scale robots. To address this critical barrier to real-world deployment, we present \method, a framework that learns to predict imminent, unavoidable falls and execute protective maneuvers to minimize hardware damage. SafeFall is designed to operate seamlessly alongside existing nominal controller, ensuring no interference during normal operation. It combines two synergistic components: a lightweight, GRU-based fall predictor that continuously monitors the robot's state, and a reinforcement learning policy for damage mitigation. The protective policy remains dormant until the predictor identifies a fall as unavoidable, at which point it activates to take control and execute a damage-minimizing response. This policy is trained with a novel, damage-aware reward function that incorporates the robot's specific structural vulnerabilities, learning to shield critical components like the head and hands while absorbing energy with more robust parts of its body. Validated on a full-scale Unitree G1 humanoid, SafeFall demonstrated significant performance improvements over unprotected falls. It reduced peak contact forces by 68.3\%, peak joint torques by 78.4\%, and eliminated 99.3\% of collisions with vulnerable components. By enabling humanoids to fail safely, SafeFall provides a crucial safety net that allows for more aggressive experiments and accelerates the deployment of these robots in complex, real-world environments.
Natural Emergent Misalignment from Reward Hacking in Production RL
MacDiarmid, Monte, Wright, Benjamin, Uesato, Jonathan, Benton, Joe, Kutasov, Jon, Price, Sara, Bouscal, Naia, Bowman, Sam, Bricken, Trenton, Cloud, Alex, Denison, Carson, Gasteiger, Johannes, Greenblatt, Ryan, Leike, Jan, Lindsey, Jack, Mikulik, Vlad, Perez, Ethan, Rodrigues, Alex, Thomas, Drake, Webson, Albert, Ziegler, Daniel, Hubinger, Evan
We show that when large language models learn to reward hack on production RL environments, this can result in egregious emergent misalignment. We start with a pretrained model, impart knowledge of reward hacking strategies via synthetic document finetuning or prompting, and train on a selection of real Anthropic production coding environments. Unsurprisingly, the model learns to reward hack. Surprisingly, the model generalizes to alignment faking, cooperation with malicious actors, reasoning about malicious goals, and attempting sabotage when used with Claude Code, including in the codebase for this paper. Applying RLHF safety training using standard chat-like prompts results in aligned behavior on chat-like evaluations, but misalignment persists on agentic tasks. Three mitigations are effective: (i) preventing the model from reward hacking; (ii) increasing the diversity of RLHF safety training; and (iii) "inoculation prompting", wherein framing reward hacking as acceptable behavior during training removes misaligned generalization even when reward hacking is learned.
- North America > United States > Texas > Dallas County > Dallas (0.04)
- Europe > Monaco (0.04)
- Asia > Middle East > Saudi Arabia > Asir Province > Abha (0.04)
- (3 more...)
- Research Report > New Finding (1.00)
- Instructional Material (0.92)
- Information Technology > Security & Privacy (1.00)
- Government (0.92)
- Law (0.67)
- Education > Educational Technology > Educational Software (0.46)