Law
Spatio-Temporal Graph Unlearning
Guo, Qiming, Sun, Wenbo, Wang, Wenlu
Spatio-temporal graphs are widely used in modeling complex dynamic processes such as traffic forecasting, molecular dynamics, and healthcare monitoring. Recently, stringent privacy regulations such as GDPR and CCPA have introduced significant new challenges for existing spatio-temporal graph models, requiring complete unlearning of unauthorized data. Since each node in a spatio-temporal graph diffuses information globally across both spatial and temporal dimensions, existing unlearning methods primarily designed for static graphs and localized data removal cannot efficiently erase a single node without incurring costs nearly equivalent to full model retraining. Therefore, an effective approach for complete spatio-temporal graph unlearning is a pressing need. To address this, we propose CallosumNet, a divide-and-conquer spatio-temporal graph unlearning framework inspired by the corpus callosum structure that facilitates communication between the brain's two hemispheres. CallosumNet incorporates two novel techniques: (1) Enhanced Subgraph Construction (ESC), which adaptively constructs multiple localized subgraphs based on several factors, including biologically-inspired virtual ganglions; and (2) Global Ganglion Bridging (GGB), which reconstructs global spatio-temporal dependencies from these localized subgraphs, effectively restoring the full graph representation. Empirical results on four diverse real-world datasets show that CallosumNet achieves complete unlearning with only 1%-2% relative MAE loss compared to the gold model, significantly outperforming state-of-the-art baselines. Ablation studies verify the effectiveness of both proposed techniques.
mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models
Mukherjee, Arka, Ghosh, Shreya
Contemporary vision-language models (VLMs) perform well on existing multimodal reasoning benchmarks (78-85\% accuracy on MMMU, MathVista). Yet, these results fail to sufficiently distinguish true scientific reasoning articulation capabilities from pattern-matching. To address this gap, we introduce \textbf{mmJEE-Eval}, a multimodal bilingual (English and Hindi) benchmark comprising 1,460 questions from India's JEE Advanced examination (2019-2025) spanning pre-college Physics, Chemistry, and Mathematics domains. Our evaluation of 17 state-of-the-art models reveals that while frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash) achieve 77-84\% accuracy on held-out 2025 questions, open-source models plateau at 37-45\% despite scaling to 400B parameters, a significant difference not observed on existing benchmarks. While closed frontiers from Google and OpenAI show high problem-solving accuracies (up to 100\% pass@3 scores), they fully collapse when the reasoning load is increased meta-cognitively (GPT-5 fixes just 5.2\% errors). Systematic ablations show mmJEE-Eval's difficulty stems from complexity and reasoning depth rather than memorization. Effectively, our benchmark segregates superior training and reasoning methodologies where alternatives fail. We publicly release our code and data: https://mmjee-eval.github.io
GAMMA_FLOW: Guided Analysis of Multi-label spectra by MAtrix Factorization for Lightweight Operational Workflows
Rรคdle, Viola, Hartwig, Tilman, Oesen, Benjamin, Krรถger, Emily Alice, Vogt, Julius, Gericke, Eike, Baron, Martin
GAMMA_FLOW is an open-source Python package for real-time analysis of spectral data. It supports classification, denoising, decomposition, and outlier detection of both single- and multi-component spectra. Instead of relying on large, computationally intensive models, it employs a supervised approach to non-negative matrix factorization (NMF) for dimensionality reduction. This ensures a fast, efficient, and adaptable analysis while reducing computational costs. gamma_flow achieves classification accuracies above 90% and enables reliable automated spectral interpretation. Originally developed for gamma-ray spectra, it is applicable to any type of one-dimensional spectral data. As an open and flexible alternative to proprietary software, it supports various applications in research and industry.
Multi-step Predictive Coding Leads To Simplicity Bias
Predictive coding is a framework for understanding the formation of low-dimensional internal representations mirroring the environment's latent structure. The conditions under which such representations emerge remain unclear. In this work, we investigate how the prediction horizon and network depth shape the solutions of predictive coding tasks. Using a minimal abstract setting inspired by prior work, we show empirically and theoretically that sufficiently deep networks trained with multi-step prediction horizons consistently recover the underlying latent structure, a phenomenon explained through the Ordinary Least Squares estimator structure and biases in learning dynamics. We then extend these insights to nonlinear networks and complex datasets, including piecewise linear functions, MNIST, multiple latent states and higher dimensional state geometries. Our results provide a principled understanding of when and why predictive coding induces structured representations, bridging the gap between empirical observations and theoretical foundations.
A Neurosymbolic Approach to Natural Language Formalization and Verification
Bayless, Sam, Buliani, Stefano, Cassel, Darion, Cook, Byron, Clough, Duncan, Delmas, Rรฉmi, Diallo, Nafi, Erata, Ferhat, Feng, Nick, Giannakopoulou, Dimitra, Goel, Aman, Gokhale, Aditya, Hendrix, Joe, Hudak, Marc, Jovanoviฤ, Dejan, Kent, Andrew M., Kiesl-Reiter, Benjamin, Kuna, Jeffrey J., Labai, Nadia, Lilien, Joseph, Raghunathan, Divya, Rakamariฤ, Zvonimir, Razavi, Niloofar, Tautschnig, Michael, Torkamani, Ali, Weir, Nathaniel, Whalen, Michael W., Yao, Jianan
Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autofor-malization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our approach produces auditable logical artifacts that substantiate the verification outcomes and can be used to improve the original text. The content generation and reasoning capabilities of Large Language Models (LLMs) continue to advance rapidly, demonstrating unprecedented improvements in coherence and analytical accuracy (Wei et al., 2022; Y ao et al., 2023; Lewis et al., 2021). Despite these advances, their probabilistic nature and tendency to generate plausible but incorrect information (hallucinations, cf. Xu et al. 2024b) remain barriers to widespread adoption in regulated sectors. Industries such as healthcare, financial services, and legal practices have legal and regulatory obligations for accuracy and auditability that current LLM technology has yet to meet (Haltaufderheide & Ranisch, 2024). Companies develop institutional policies to ensure compliance with applicable laws and regulations. Such policies are typically captured in natural language (NL) documents that define rules, procedures, or guidelines. A challenge thus emerges when organizations look to deploy LLMs to answer questions about such documents: can we develop guardrails to ensure that LLM outputs conform to institutional policies? Consider an airline implementing a chatbot to assist customer service representatives in navigating refund policies: if the chatbot incorrectly claims that a customer is eligible for a refund when they are not, this could lead to legal exposure and loss of customer trust. An effective guardrail would help representatives decide if they can rely on a chatbot response without spending additional human effort to verify it. The key concern would be to ensure that when the guardrail reports an answer is valid, it actually is.
3D Guard-Layer: An Integrated Agentic AI Safety System for Edge Artificial Intelligence
Kurshan, Eren, Xie, Yuan, Franzon, Paul
--AI systems have found a wide range of real-world applications in recent years. The adoption of edge artificial intelligence, embedding AI directly into edge devices, is rapidly growing. Despite the implementation of guardrails and safety mechanisms, security vulnerabilities and challenges have become increasingly prevalent in this domain, posing a significant barrier to the practical deployment and safety of AI systems. This paper proposes an agentic AI safety architecture that leverages 3D to integrate a dedicated safety layer. It introduces an adaptive AI safety infrastructure capable of dynamically learning and mitigating attacks against the AI system. The system leverages the inherent advantages of co-location with the edge computing hardware to continuously monitor, detect and proactively mitigate threats to the AI system. The integration of local processing and learning capabilities enhances resilience against emerging network-based attacks while simultaneously improving system reliability, modularity, and performance, all with minimal cost and 3D integration overhead.
AI-generated podcasts: Synthetic Intimacy and Cultural Translation in NotebookLM's Audio Overviews
This paper analyses AI-generated podcasts produced by Google's NotebookLM, which generates audio podcasts with two chatty AI hosts discussing whichever documents a user uploads. While AI-generated podcasts have been discussed as tools, for instance in medical education, they have not yet been analysed as media. By uploading different types of text and analysing the generated outputs I show how the podcasts' structure is built around a fixed template. I also find that NotebookLM not only translates texts from other languages into a perky standardised Mid-Western American accent, it also translates cultural contexts to a white, educated, middle-class American default. This is a distinct development in how publics are shaped by media, marking a departure from the multiple public spheres that scholars have described in human podcasting from the early 2000s until today, where hosts spoke to specific communities and responded to listener comments, to an abstraction of the podcast genre.
Pattern Recognition of Scrap Plastic Misclassification in Global Trade Data
We propose an interpretable machine learning framework to help identify trade data discrepancies that are challenging to detect with traditional methods. Our system analyzes trade data to find a novel inverse price-volume signature, a pattern where reported volumes increase as average unit prices decrease. The model achieves 0.9375 accuracy and was validated by comparing large-scale UN data with detailed firm-level data, confirming that the risk signatures are consistent. This scalable tool provides customs authorities with a transparent, data-driven method to shift from conventional to priority-based inspection protocols, translating complex data into actionable intelligence to support international environmental policies.
Self-HarmLLM: Can Large Language Model Harm Itself?
Kim, Heehwan, Park, Sungjune, Choi, Daeseon
Large Language Models (LLMs) are generally equipped with guardrails to block the generation of harmful responses. However, existing defenses always assume that an external attacker crafts the harmful query, and the possibility of a model's own output becoming a new attack vector has not been sufficiently explored. In this study, we propose the Self-HarmLLM scenario, which uses a Mitigated Harmful Query (MHQ) generated by the same model as a new input. An MHQ is an ambiguous query whose original intent is preserved while its harmful nature is not directly exposed. We verified whether a jailbreak occurs when this MHQ is re-entered into a separate session of the same model. We conducted experiments on GPT-3.5-turbo, LLaMA3-8B-instruct, and DeepSeek-R1-Distill-Qwen-7B under Base, Zero-shot, and Few-shot conditions. The results showed up to 52% transformation success rate and up to 33% jailbreak success rate in the Zero-shot condition, and up to 65% transformation success rate and up to 41% jailbreak success rate in the Few-shot condition. By performing both prefix-based automated evaluation and human evaluation, we found that the automated evaluation consistently overestimated jailbreak success, with an average difference of 52%. This indicates that automated evaluation alone is not accurate for determining harmfulness. While this study is a toy-level study based on a limited query set and evaluators, it proves that our method can still be a valid attack scenario. These results suggest the need for a fundamental reconsideration of guardrail design and the establishment of a more robust evaluation methodology.
Knowledge Graph Analysis of Legal Understanding and Violations in LLMs
Jha, Abha, Salinas, Abel, Morstatter, Fred
The rise of Large Language Models (LLMs) offers transfor-mative potential for interpreting complex legal frameworks, such as Title 18 Section 175 of the US Code, which governs biological weapons. These systems hold promise for advancing legal analysis and compliance monitoring in sensitive domains. However, this capability comes with a troubling contradiction: while LLMs can analyze and interpret laws, they also demonstrate alarming vulnerabilities in generating unsafe outputs, such as actionable steps for bioweapon creation, despite their safeguards. To address this challenge, we propose a methodology that integrates knowledge graph construction with Retrieval-Augmented Generation (RAG) to systematically evaluate LLMs' understanding of this law, their capacity to assess legal intent (mens rea), and their potential for unsafe applications. Through structured experiments, we assess their accuracy in identifying legal violations, generating prohibited instructions, and detecting unlawful intent in bioweapons-related scenarios. Our findings reveal significant limitations in LLMs' reasoning and safety mechanisms, but they also point the way forward. By combining enhanced safety protocols with more robust legal reasoning frameworks, this research lays the groundwork for developing LLMs that can ethically and securely assist in sensitive legal domains--ensuring they act as protectors of the law rather than inadvertent enablers of its violation.