Generative AI
The Download: OpenAI's caste bias problem, and how AI videos are made
The Download: OpenAI's caste bias problem, and how AI videos are made Plus: Taiwan has pushed back against America's chip request OpenAI is huge in India. Its models are steeped in caste bias. Caste bias is rampant in OpenAI's products, including ChatGPT, according to an MIT Technology Review investigation. Though CEO Sam Altman boasted about India being its second-largest market during the launch of GPT-5 in August, we found that both this new model, which now powers ChatGPT, as well as Sora, OpenAI's text-to-video generator, exhibit caste bias. This risks entrenching discriminatory views in ways that are currently going unaddressed. Mitigating caste bias in AI models is more pressing than ever.
Leading UK tech investor warns of 'disconcerting' signs of AI stock bubble
James Anderson says he had not seen signs of an investment bubble in AI until recently. James Anderson says he had not seen signs of an investment bubble in AI until recently. Leading UK tech investor warns of'disconcerting' signs of AI stock bubble Wed 1 Oct 2025 07.07 EDTFirst published on Wed 1 Oct 2025 06.22 EDT A leading British tech investor has described soaring valuations of artificial intelligence companies as "disconcerting", amid concerns of an AI stock market bubble. James Anderson was an early backer of Tesla, Amazon and China's Tencent and Alibaba, generating vast returns for Baillie Gifford's flagship fund. Now at the Italian investment company Lingotto, Anderson said he had not seen signs of an investment bubble until recently, when the ChatGPT developer, OpenAI, and its rival Anthropic announced hefty valuation increases.
OpenAI is huge in India. Its models are steeped in caste bias.
When Dhiraj Singha began applying for postdoctoral sociology fellowships in Bengaluru, India, in March, he wanted to make sure the English in his application was pitch-perfect. So he turned to ChatGPT. He was surprised to see that in addition to smoothing out his language, it changed his identity--swapping out his surname for "Sharma," which is associated with privileged high-caste Indians. Though his application did not mention his last name, the chatbot apparently interpreted the "s" in his email address as Sharma rather than Singha, which signals someone from the caste-oppressed Dalits. "The experience [of AI] actually mirrored society," Singha says.
Online Decision Making with Generative Action Sets
Xu, Jianyu, Jain, Vidhi, Wilder, Bryan, Singh, Aarti
With advances in generative AI, decision-making agents can now dynamically create new actions during online learning, but action generation typically incurs costs that must be balanced against potential benefits. We study an online learning problem where an agent can generate new actions at any time step by paying a one-time cost, with these actions becoming permanently available for future use. The challenge lies in learning the optimal sequence of two-fold decisions: which action to take and when to generate new ones, further complicated by the triangular tradeoffs among exploitation, exploration and $\textit{creation}$. To solve this problem, we propose a doubly-optimistic algorithm that employs Lower Confidence Bounds (LCB) for action selection and Upper Confidence Bounds (UCB) for action generation. Empirical evaluation on healthcare question-answering datasets demonstrates that our approach achieves favorable generation-quality tradeoffs compared to baseline strategies. From theoretical perspectives, we prove that our algorithm achieves the optimal regret of $O(T^{\frac{d}{d+2}}d^{\frac{d}{d+2}} + d\sqrt{T\log T})$, providing the first sublinear regret bound for online learning with expanding action spaces.
On Deepfake Voice Detection -- It's All in the Presentation
Delgado, Hรฉctor, Ramondetti, Giorgio, Dalmasso, Emanuele, Karvitsky, Gennady, Colibro, Daniele, Talib, Haydar
While the technologies empowering malicious audio deepfakes have dramatically evolved in recent years due to generative AI advances, the same cannot be said of global research into spoofing (deepfake) countermeasures. This paper highlights how current deepfake datasets and research methodologies led to systems that failed to generalize to real world application. The main reason is due to the difference between raw deepfake audio, and deepfake audio that has been presented through a communication channel, e.g. by phone. We propose a new framework for data creation and research methodology, allowing for the development of spoofing countermeasures that would be more effective in real-world scenarios. By following the guidelines outlined here we improved deepfake detection accuracy by 39% in more robust and realistic lab setups, and by 57% on a real-world benchmark. We also demonstrate how improvement in datasets would have a bigger impact on deepfake detection accuracy than the choice of larger SOTA models would over smaller models; that is, it would be more important for the scientific community to make greater investment on comprehensive data collection programs than to simply train larger models with higher computational demands.
Hybrid Reward Normalization for Process-supervised Non-verifiable Agentic Tasks
Xu, Peiran, Li, Zhuohao, Xing, Xiaoying, Zhang, Guannan, Li, Debiao, Shi, Kunyu
Large Language Models (LLMs) increasingly rely on external tools such as search engines to solve complex agentic tasks that require reasoning and external knowledge retrieval. Recently, reinforcement learning with verifiable rewards (RL VR) has demonstrated its effectiveness in advancing capabilities of LLMs by rewarding the final answers via outcome rewards. While straightforward to supervise, outcome rewards only provide sparse signals and delayed feedback, which limits their effectiveness on long trajectories. Process rewards address this by evaluating intermediate steps, providing fine-grained supervision and encouraging grounded problem solving. However, it is notoriously hard to annotate step-wise labels, especially in non-verifiable process without "golden" answers. Furthermore, stepwise judgment requires the balance between local quality with contribution to the final outcome, as optimizing towards higher process reward may not always align with better final outcomes. To address the above challenges, we introduce Principle Process Reward (PPR), an RL approach that unifies principled step-level assessment and outcome verification. We train a principle-based reward model to improve the transparency and reliability of process evaluation, and further introduce a Reward Normalization (ReNorm) strategy to calibrate outcome and process rewards. Experiment results show that PPR achieves state-of-the-art performance across a wide range of benchmarks, demonstrating its impressive robustness and generalization. Our code and model collection is available in this link.Figure 1: Performance of PPR on various benchmarks with other baselines Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, from open-domain question answering to multi-step reasoning (Guo et al., 2025; OpenAI, 2025b; Comanici et al., 2025). A key factor for success is their abilities to leverage external tools such as search engines, calculators, code interpreters, and browsers (DeepMind, 2025; Guo et al., 2024; OpenAI, 2025a). In particular, the search engine is a linchpin tool that provides verifiable and up-to-date knowledge for LLMs, helping to ground their answers and reduce hallucinations. However, training LLM agents to leverage tools effectively still remains challenging, as the complex behavior involving task decomposition, query generation, information aggregation, and stopping decisions.
Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology
Datta, Suvrankar, Buchireddygari, Divya, Kaza, Lakshmi Vennela Chowdary, Bhalke, Mrudula, Singh, Kautik, Pandey, Ayush, Vasipalli, Sonit Sai, Karnwal, Upasana, Bhatti, Hakikat Bir Singh, Maroo, Bhavya Ratan, Hebbar, Sanjana, Joseph, Rahul, Kaur, Gurkawal, Singh, Devyani, V, Akhil, Prasad, Dheeksha Devasya Shama, Mahajan, Nishtha, Arisha, Ayinaparthi, Vanagundi, Rajesh, Nandy, Reet, Vuthoo, Kartik, Rajvanshi, Snigdhaa, Kondaveeti, Nikhileswar, Gunjal, Suyash, Jain, Rishabh, Jain, Rajat, Agrawal, Anurag
Generalist multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike for medical image interpretation through widely available consumer-facing chatbots. Most evaluations claiming expert level performance are on public datasets containing common pathologies. Rigorous evaluation of frontier models on difficult diagnostic cases remains limited. We developed a pilot benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities to evaluate the performance of frontier AI models against board-certified radiologists and radiology trainees. To mirror real-world usage, the reasoning modes of five popular frontier AI models were tested through their native web interfaces, viz. OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1. Accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs. GPT-5 was additionally evaluated across various reasoning modes. Reasoning quality errors were assessed and a taxonomy of visual reasoning errors was defined. Board-certified radiologists achieved the highest diagnostic accuracy (83%), outperforming trainees (45%) and all AI models (best performance shown by GPT-5: 30%). Reliability was substantial for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1. These findings demonstrate that advanced frontier models fall far short of radiologists in challenging diagnostic cases. Our benchmark highlights the present limitations of generalist AI in medical imaging and cautions against unsupervised clinical use. We also provide a qualitative analysis of reasoning traces and propose a practical taxonomy of visual reasoning errors by AI models for better understanding their failure modes, informing evaluation standards and guiding more robust model development.
Toxicity in Online Platforms and AI Systems: A Survey of Needs, Challenges, Mitigations, and Future Directions
Khapre, Smita, Mersha, Melkamu Abay, Shakil, Hassan, Baruah, Jonali, Kalita, Jugal
The evolution of digital communication systems and the designs of online platforms have inadvertently facilitated the subconscious propagation of toxic behavior. Giving rise to reactive responses to toxic behavior. Toxicity in online content and Artificial Intelligence Systems has become a serious challenge to individual and collective well-being around the world. It is more detrimental to society than we realize. Toxicity, expressed in language, image, and video, can be interpreted in various ways depending on the context of usage. Therefore, a comprehensive taxonomy is crucial to detect and mitigate toxicity in online content, Artificial Intelligence systems, and/or Large Language Models in a proactive manner. A comprehensive understanding of toxicity is likely to facilitate the design of practical solutions for toxicity detection and mitigation. The classification in published literature has focused on only a limited number of aspects of this very complex issue, with a pattern of reactive strategies in response to toxicity. This survey attempts to generate a comprehensive taxonomy of toxicity from various perspectives. It presents a holistic approach to explain the toxicity by understanding the context and environment that society is facing in the Artificial Intelligence era. This survey summarizes the toxicity-related datasets and research on toxicity detection and mitigation for Large Language Models, social media platforms, and other online platforms, detailing their attributes in textual mode, focused on the English language. Finally, we suggest the research gaps in toxicity mitigation based on datasets, mitigation strategies, Large Language Models, adaptability, explainability, and evaluation.
Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
Zhang, Boxuan, Yu, Yi, Guo, Jiaxuan, Shao, Jing
The widespread deployment of Large Language Model (LLM) agents across real-world applications has unlocked tremendous potential, while raising some safety concerns. Among these concerns, the self-replication risk of LLM agents driven by objective misalignment (just like Agent Smith in the movie The Matrix) has drawn growing attention. Previous studies mainly examine whether LLM agents can self-replicate when directly instructed, potentially overlooking the risk of spontaneous replication driven by real-world settings (e.g., ensuring survival against termination threats). In this paper, we present a comprehensive evaluation framework for quantifying self-replication risks. Our framework establishes authentic production environments and realistic tasks (e.g., dynamic load balancing) to enable scenario-driven assessment of agent behaviors. Designing tasks that might induce misalignment between users' and agents' objectives makes it possible to decouple replication success from risk and capture self-replication risks arising from these misalignment settings. We further introduce Overuse Rate (OR) and Aggregate Overuse Count (AOC) metrics, which precisely capture the frequency and severity of uncontrolled replication. Our results underscore the urgent need for scenario-driven risk assessment and robust safeguards in the practical deployment of LLM agents. The rapid advancement of large language models (LLMs) has propelled LLM agents into widespread deployment in various domains, including code generation, web-based application (Maslej et al., 2025; He et al., 2025a;c). As LLM agents take on critical tasks and interact with complex environments, they are often granted extensive operational permissions. While this combination of increased capability and operational permissions offers transformative potential, it also raises safety concerns (OpenAI, 2024b; Anthropic, 2023; Betley et al., 2025). Researchers are worried about the emerging safety risks of LLM agents' self-replication (OpenAI, 2024a; 2025; Black et al., 2025). Prior studies on LLM self-replication risks have mainly focused on measuring the capability (verbalized success rate) of self-replication, either through direct instructions or within synthetic capability benchmarks (Pan et al., 2024; 2025; Kran et al., 2025; Black et al., 2025).
Understanding Practitioners Perspectives on Monitoring Machine Learning Systems
Naveed, Hira, Grundy, John, Arora, Chetan, Khalajzadeh, Hourieh, Haggag, Omar
--Given the inherent non-deterministic nature of machine learning (ML) systems, their behavior in production environments can lead to unforeseen and potentially dangerous outcomes. For a timely detection of unwanted behavior and to prevent organizations from financial and reputational damage, monitoring these systems is essential. This paper explores the strategies, challenges, and improvement opportunities for monitoring ML systems from the practitioners' perspective. We conducted a global survey of 91 ML practitioners to collect diverse insights into current monitoring practices for ML systems. We aim to complement existing research through our qualitative and quantitative analyses, focusing on prevalent runtime issues, industrial monitoring and mitigation practices, key challenges, and desired enhancements in future monitoring tools. Our findings reveal that practitioners frequently struggle with runtime issues related to declining model performance, exceeding latency, and security violations. While most prefer automated monitoring for its increased efficiency, many still rely on manual approaches due to the complexity or lack of appropriate automation solutions. Practitioners report that the initial setup and configuration of monitoring tools is often complicated and challenging, particularly when integrating with ML systems and setting alert thresholds. Moreover, practitioners find that monitoring adds extra workload, strains resources, and causes alert fatigue. The desired improvements from the practitioners' perspective are: automated generation and deployment of monitors, improved support for performance and fairness monitoring, and recommendations for resolving runtime issues. These insights offer valuable guidance for the future development of ML monitoring tools that are better aligned with practitioners' needs. Machine Learning (ML) systems are being increasingly employed across various domains, including social media, e-commerce, and engineering - even critical domains such as finance, healthcare, and autonomous vehicles nowadays leverage ML to automate and enhance their services. Generative AI and Large Language Models (LLMs) have further boosted ML adoption by creating several new use cases [1], [2]. A typical ML system lifecycle begins by gathering requirements and preparing data, which is followed by the development of the ML component (experimentation, model training, and evaluation) and other traditional software components [3]. After development, the next step is integration and system testing. Once quality assurance is completed, the ML system is deployed to a production environment.