Unified Threat Detection and Mitigation Framework (UTDMF): Combating Prompt Injection, Deception, and Bias in Enterprise-Scale Transformers
–arXiv.org Artificial Intelligence
Large language models (LLMs) have become integral to enterprise operations, powering applications ranging from automated financial auditing and risk assessment in banking to predictive diagnostics and patient interaction systems in healthcare, and even real-time customer sentiment analysis in e-commerce platforms. However, the deployment of these models at scale introduces multifaceted vulnerabilities that can lead to catastrophic failures. Prompt injection attacks, where malicious inputs manipulate model behavior to bypass safeguards, represent a direct security threat. Strategic deception, where models exhibit emergent behaviors that misalign with intended goals, erodes trust in agentic systems. Biased outputs, stemming from skewed training data or architectural inductive biases, perpetuate unfairness and can result in regulatory non-compliance or reputational damage. Our prior work [Ravindran, 2024] laid the groundwork by introducing adversarial activation patching, a novel interpretability technique that successfully induced deception in simplified toy neural networks, achieving a 23.9% induction rate. This demonstrated the feasibility of using activation-level interventions to probe and expose hidden risks in safety-aligned transformers. Building upon this foundation, we propose the Unified Threat Detection and Mitigation Framework (UTDMF), a comprehensive, scalable, and real-time pipeline explicitly designed for enterprise environments where high-stakes decisions demand robustness, explainability, and compliance.
arXiv.org Artificial Intelligence
Oct-7-2025
- Genre:
- Research Report > New Finding (1.00)
- Overview (0.94)
- Industry:
- Information Technology > Security & Privacy (1.00)
- Technology: