Generative AI
DIRF: A Framework for Digital Identity Protection and Clone Governance in Agentic AI Systems
Atta, Hammad, Baig, Muhammad Zeeshan, Mehmood, Yasir, Shahzad, Nadeem, Huang, Ken, Haq, Muhammad Aziz Ul, Awais, Muhammad, Ahmed, Kamal, Green, Anthony
The rapid advancement and widespread adoption of generative artificial intelligence (AI) pose significant threats to the integrity of personal identity, including digital cloning, sophisticated impersonation, and the unauthorized monetization of identity-related data. Mitigating these risks necessitates the development of robust AI-generated content detection systems, enhanced legal frameworks, and ethical guidelines. This paper introduces the Digital Identity Rights Framework (DIRF), a structured security and governance model designed to protect behavioral, biometric, and personality-based digital likeness attributes to address this critical need. Structured across nine domains and 63 controls, DIRF integrates legal, technical, and hybrid enforcement mechanisms to secure digital identity consent, traceability, and monetization. We present the architectural foundations, enforcement strategies, and key use cases supporting the need for a unified framework. This work aims to inform platform builders, legal entities, and regulators about the essential controls needed to enforce identity rights in AI-driven systems.
Breaking the ICE: Exploring promises and challenges of benchmarks for Inference Carbon & Energy estimation for LLMs
Sikand, Samarth, Mehra, Rohit, Pathania, Priyavanshi, Bamby, Nikhil, Sharma, Vibhu Saujanya, Kaulgud, Vikrant, Podder, Sanjay, Burden, Adam P.
While Generative AI stands to be one of the fastest adopted technologies ever, studies have made evident that the usage of Large Language Models (LLMs) puts significant burden on energy grids and our environment. It may prove a hindrance to the Sustainability goals of any organization. A crucial step in any Sustainability strategy is monitoring or estimating the energy consumption of various components. While there exist multiple tools for monitoring energy consumption, there is a dearth of tools/frameworks for estimating the consumption or carbon emissions. Current drawbacks of both monitoring and estimation tools include high input data points, intrusive nature, high error margin, etc. We posit that leveraging emerging LLM benchmarks and related data points can help overcome aforementioned challenges while balancing accuracy of the emission estimations. To that extent, we discuss the challenges of current approaches and present our evolving framework, R-ICE, which estimates prompt level inference carbon emissions by leveraging existing state-of-the-art(SOTA) benchmark. This direction provides a more practical and non-intrusive way to enable emerging use-cases like dynamic LLM routing, carbon accounting, etc. Our promising validation results suggest that benchmark-based modelling holds great potential for inference emission estimation and warrants further exploration from the scientific community.
Project Riley: Multimodal Multi-Agent LLM Collaboration with Emotional Reasoning and Voting
Ortigoso, Ana Rita, Vieira, Gabriel, Fuentes, Daniel, Frazรฃo, Luis, Costa, Nuno, Pereira, Antรณnio
This paper presents Project Riley, a novel multimodal and multi-model conversational AI architecture oriented towards the simulation of reasoning influenced by emotional states. Drawing inspiration from Pixar's Inside Out, the system comprises five distinct emotional agents - Joy, Sadness, Fear, Anger, and Disgust - that engage in structured multi-round dialogues to generate, criticise, and iteratively refine responses. A final reasoning mechanism synthesises the contributions of these agents into a coherent output that either reflects the dominant emotion or integrates multiple perspectives. The architecture incorporates both textual and visual large language models (LLMs), alongside advanced reasoning and self-refinement processes. A functional prototype was deployed locally in an offline environment, optimised for emotional expressiveness and computational efficiency. From this initial prototype, another one emerged, called Armando, which was developed for use in emergency contexts, delivering emotionally calibrated and factually accurate information through the integration of Retrieval-Augmented Generation (RAG) and cumulative context tracking. The Project Riley prototype was evaluated through user testing, in which participants interacted with the chatbot and completed a structured questionnaire assessing three dimensions: Emotional Appropriateness, Clarity and Utility, and Naturalness and Human-likeness. The results indicate strong performance in structured scenarios, particularly with respect to emotional alignment and communicative clarity.
Human-AI Collaboration or Academic Misconduct? Measuring AI Use in Student Writing Through Stylometric Evidence
Oliveira, Eduardo Araujo, Mohoni, Madhavi, Lรณpez-Pernas, Sonsoles, Saqr, Mohammed
Human - Artificial Intelligence (HAI) collaboration in writing offers opportunities to enhance efficiency and boost student confidence; however, it also carries risks, such as reduced creativity, over - reliance on AI - generated content, and academic integrity (Kim & Lee, 2023) . While the ethical use of AI in education is widely acknowledged as a way to enhance student learning (Cotton et al., 2023; Foltynek et al., 2023), the rise of Unauthorised Content Generation (UCG) presents a significant challenge to academic misconduct. Measuring the extent and nature of HAI collaboration in academic contexts remains a critical challenge for educators, particularly as generative AI (genAI) tools become increasingly available and integrated into educational settings (Atchley et al., 2024; E. Oliveira et al., 2023) . Distinguishing AI - generated text from human - authored content is necessary for understanding student learning behaviours, supporting skill development, and maintaining academic integrity. Analysing student writing patterns can help educators evaluate how st udents engage with AI tools, track their writing skill progression, and identify areas where additional support is needed (Pan et al., 2025). Existing detection tools for AI - assisted misconduct often lack reliability, explainability, and resilience to circ umvention strategies such as paraphrasing (Cotton et al., 2023) . These challenges highlight the need for innovative, transparent, and robust approaches to address the unacknowledged use of genAI in HAI collaboration within academic writing (Kasneci et al., 2023) .
Can LLMs Simulate Personas with Reversed Performance? A Benchmark for Counterfactual Instruction Following
Kumar, Sai Adith Senthil, Yan, Hao, Perepa, Saipavan, Yue, Murong, Yao, Ziyu
Large Language Models (LLMs) are now increasingly widely used to simulate personas in virtual environments, leveraging their instruction-following capability. However, we discovered that even state-of-the-art LLMs cannot simulate personas with reversed performance (e.g., student personas with low proficiency in educational settings), which impairs the simulation diversity and limits the practical applications of the simulated environments. In this work, using mathematical reasoning as a representative scenario, we propose the first benchmark dataset for evaluating LLMs on simulating personas with reversed performance, a capability that we dub "counterfactual instruction following". We evaluate both open-weight and closed-source LLMs on this task and find that LLMs, including the OpenAI o1 reasoning model, all struggle to follow counterfactual instructions for simulating reversedly performing personas. Intersectionally simulating both the performance level and the race population of a persona worsens the effect even further. These results highlight the challenges of counterfactual instruction following and the need for further research.
Mitigation of Gender and Ethnicity Bias in AI-Generated Stories through Model Explanations
Dimgba, Martha O., Oba, Sharon, Agrawal, Ameeta, Giabbanelli, Philippe J.
Language models have been shown to propagate social bias through their output, particularly in the representation of gender and ethnicity. This paper investigates gender and ethnicity biases in AI-generated occupational stories. Representation biases are measured before and after applying our proposed mitigation strategy, Bias Analysis and Mitigation through Explanation (BAME), revealing improvements in demographic representation ranging from 2% to 20%. BAME leverages model-generated explanations to inform targeted prompt engineering, effectively reducing biases without modifying model parameters. By analyzing stories generated across 25 occupational groups, three large language models (Claude 3.5 Sonnet, Llama 3.1 70B Instruct, and GPT-4 Turbo), and multiple demographic dimensions, we identify persistent patterns of overrepresentation and underrepresentation linked to training data stereotypes. Our findings demonstrate that guiding models with their own internal reasoning mechanisms can significantly enhance demographic parity, thereby contributing to the development of more transparent generative AI systems.
Exploring persuasive interactions with generative social robots: An experimental framework
Vonschallen, Stephan, Finsler, Larissa Julia Corina, Schmiedel, Theresa, Eyssel, Friederike
Integrating generative AI such as Large Language Models into social robots has improved their ability to engage in natural, human-like communication. This study presents a method to examine their persuasive capabilities. We designed an experimental framework focused on decision making and tested it in a pilot that varied robot appearance and self-knowledge. Using qualitative analysis, we evaluated interaction quality, persuasion effectiveness, and the robot's communicative strategies. Participants generally experienced the interaction positively, describing the robot as competent, friendly, and supportive, while noting practical limits such as delayed responses and occasional speech-recognition errors. Persuasiveness was highly context dependent and shaped by robot behavior: Participants responded well to polite, reasoned suggestions and expressive gestures, but emphasized the need for more personalized, context-aware arguments and clearer social roles. These findings suggest that generative social robots can influence user decisions, but their effectiveness depends on communicative nuance and contextual relevance. We propose refinements to the framework to further study persuasive dynamics between robots and human users.
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Wang, Haoming, Zou, Haoyang, Song, Huatong, Feng, Jiazhan, Fang, Junjie, Lu, Junting, Liu, Longxiang, Luo, Qinyu, Liang, Shihao, Huang, Shijue, Zhong, Wanjun, Ye, Yining, Qin, Yujia, Xiong, Yuwen, Song, Yuxin, Wu, Zhiyong, Li, Aoyan, Li, Bo, Dun, Chen, Liu, Chong, Zan, Daoguang, Leng, Fuxing, Wang, Hanbin, Yu, Hao, Chen, Haobin, Guo, Hongyi, Su, Jing, Huang, Jingjia, Shen, Kai, Shi, Kaiyu, Yan, Lin, Zhao, Peiyao, Liu, Pengfei, Ye, Qinghao, Zheng, Renjie, Xin, Shulin, Zhao, Wayne Xin, Heng, Wen, Huang, Wenhao, Wang, Wenqian, Qin, Xiaobo, Lin, Yi, Wu, Youbin, Chen, Zehui, Wang, Zihao, Zhong, Baoquan, Zhang, Xinchun, Li, Xujing, Li, Yuanfan, Zhao, Zhongkai, Jiang, Chengquan, Wu, Faming, Zhou, Haotian, Pang, Jinlin, Han, Li, Liu, Qi, Ma, Qianli, Liu, Siyao, Cai, Songhua, Fu, Wenqi, Liu, Xin, Wang, Yaohui, Zhang, Zhi, Zhou, Bo, Li, Guoliang, Shi, Jiajun, Yang, Jiale, Tang, Jie, Li, Li, Han, Qihua, Lu, Taoran, Lin, Woyu, Tong, Xiaokang, Li, Xinyao, Zhang, Yichi, Miao, Yu, Jiang, Zhengxuan, Li, Zili, Zhao, Ziyuan, Li, Chenxin, Ma, Dehua, Lin, Feng, Zhang, Ge, Yang, Haihua, Guo, Hangyu, Zhu, Hongda, Liu, Jiaheng, Du, Junda, Cai, Kai, Li, Kuanye, Yuan, Lichen, Han, Meilan, Wang, Minchao, Guo, Shuyue, Cheng, Tianhao, Ma, Xiaobo, Xiao, Xiaojun, Huang, Xiaolong, Chen, Xinjie, Du, Yidi, Chen, Yilin, Wang, Yiwen, Li, Zhaojian, Yang, Zhenzhu, Zeng, Zhiyuan, Jin, Chaolin, Li, Chen, Chen, Hao, Chen, Haoli, Chen, Jian, Zhao, Qinghao, Shi, Guang
The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.
DuckDuckGo's paid plan now includes advanced AI models like GPT-5
DuckDuckGo is now expanding its paid subscription with access to several of the most advanced AI models on the market. Subscribers can now access OpenAI's GPT-4o and GPT-5, Anthropic's Claude Sonnet 4, and Meta's Llama Maverick via the Duck.ai To protect privacy, Duck.ai hides the user's IP address from the AI model providers, and chat logs are saved locally and aren't used to train the AI models. In addition, there's a special "Fire Button" that lets users instantly delete previous conversations and chat histories. The price of the subscription remains unchanged at 9.99/month or 99/year.
Using generative AI, researchers design compounds that can kill drug-resistant bacteria
With help from artificial intelligence, MIT researchers have designed novel antibiotics that can combat two hard-to-treat infections: drug-resistant Neisseria gonorrhoeae and multi-drug-resistant Staphylococcus aureus (MRSA). Using generative AI algorithms, the research team designed more than 36 million possible compounds and computationally screened them for antimicrobial properties. The top candidates they discovered are structurally distinct from any existing antibiotics, and they appear to work by novel mechanisms that disrupt bacterial cell membranes. This approach allowed the researchers to generate and evaluate theoretical compounds that have never been seen before -- a strategy that they now hope to apply to identify and design compounds with activity against other species of bacteria. "We're excited about the new possibilities that this project opens up for antibiotics development. Our work shows the power of AI from a drug design standpoint, and enables us to exploit much larger chemical spaces that were previously inaccessible," says James Collins, the Termeer Professor of Medical Engineering and Science in MIT's Institute for Medical Engineering and Science (IMES) and Department of Biological Engineering, and a member of the Broad Institute.