Generative AI
Masayoshi Son and Sam Altman see no end to AI demand and scaling
SoftBank founder Masayoshi Son and OpenAI chief Sam Altman see insatiable demand for artificial intelligence (AI) that makes it imperative to keep building ever more computing capacity. Speaking via teleconference at SoftBank World, the two business partners argued that advancing AI would lead to new jobs that are not yet imagined, and the advancement of robotics will help kickstart a "self-improvement" loop. "As we drive the cost of AI down, more people want to use it," Altman said in response to Son's question about diminishing returns from further expansion. "So if we make the cost of AI 10 times cheaper, people wanna use it 30 times as much or whatever. And the demand for intelligence in the world just seems to be huge."
Another High-Profile OpenAI Researcher Departs for Meta
OpenAI researcher Jason Wei is joining Meta's new superintelligence lab, according to multiple sources familiar with the matter. Wei worked on OpenAI's o3 and deep research models, according to his personal website. He joined OpenAI in 2023 after a stint at Google, where he worked on chain-of-thought research, which involves training an AI model to process complex queries step-by-step. At OpenAI, Wei became a self-described "diehard" for reinforcement learning, a method of training or refining an AI model with positive or negative feedback. It's become a promising area of AI research--one that several of the researchers Meta has hired for its superintelligence team specialize in.
Visually grounded emotion regulation via diffusion models and user-driven reappraisal
Pinzuti, Edoardo, Tรผscher, Oliver, Castro, Andrรฉ Ferreira
Cognitive reappraisal is a key strategy in emotion regulation, involving reinterpretation of emotionally charged stimuli to alter affective responses. Despite its central role in clinical and cognitive science, real-world reappraisal interventions remain cognitively demanding, abstract, and primarily verbal. This reliance on higher-order cognitive and linguistic processes is often impaired in individuals with trauma or depression, limiting the effectiveness of standard approaches. Here, we propose a novel, visually based augmentation of cognitive reappraisal by integrating large-scale text-to-image diffusion models into the emotional regulation process. Specifically, we introduce a system in which users reinterpret emotionally negative images via spoken reappraisals, which are transformed into supportive, emotionally congruent visualizations using stable diffusion models with a fine-tuned IP-adapter. This generative transformation visually instantiates users' reappraisals while maintaining structural similarity to the original stimuli, externalizing and reinforcing regulatory intent. To test this approach, we conducted a within-subject experiment (N = 20) using a modified cognitive emotion regulation (CER) task. Participants reappraised or described aversive images from the International Affective Picture System (IAPS), with or without AI-generated visual feedback. Results show that AI-assisted reappraisal significantly reduced negative affect compared to both non-AI and control conditions. Further analyses reveal that sentiment alignment between participant reappraisals and generated images correlates with affective relief, suggesting that multimodal coherence enhances regulatory efficacy. These findings demonstrate that generative visual input can support cogitive reappraisal and open new directions at the intersection of generative AI, affective computing, and therapeutic technology.
Evaluating Multimodal Large Language Models on Educational Textbook Question Answering
Alawwad, Hessa A., Zafar, Anas, Alhothali, Areej, Naseem, Usman, Alkhathlan, Ali, Jamal, Amani
Faculty of Computing and Information Technology & Center of Research Excellence in AI and Data Science King Abdulaziz University Jeddah, Saudi Arabia Abstract --Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art MLLMs, including LLaV A-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset. We introduce a multimodal retrieval-augmented generation (RAG) pipeline to simulate real-world learning by providing relevant lesson paragraphs and diagrams as context. Our zero-shot experiments reveal a critical trade-off; while retrieved context improves LLaV A's performance on text-based questions, it significantly degrades the accuracy of the more powerful LLaMA 3.2-Vision on diagram-based tasks, dropping its validation accuracy from 74.07% to 25.93%. Furthermore, fine-tuning highlights architectural differences; LLaMA 3.2-Vision's performance substantially improves to 71.16% on the test set, demonstrating its capacity to learn multimodal integration, whereas LLaV A's performance declines, indicating challenges with generalization. Our results underscore the challenges MLLMs face in modality prioritization and context integration, providing a benchmark and pointing to key directions for developing more robust AI-driven educational tools. Personal use of this material is permitted. This work has been accepted to the 2nd International Generative AI and Computational Language Modelling Conference (GACLM 2025) for publication in the proceedings. Answering curriculum-related questions in multimodal educational materials is a central challenge in AI for education, requiring systems to reason across complex multimodal contexts such as lengthy lessons, diagrams, and videos.
Deepfake Technology Unveiled: The Commoditization of AI and Its Impact on Digital Trust
Popa, Claudiu, Pallath, Rex, Cunningham, Liam, Tahiri, Hewad, Kesavarajah, Abiram, Wu, Tao
Deepfake Technology Unveiled: The Commoditization of AI and Its Impact on Digital Trust. With the increasing accessibility of generative AI, tools for voice cloning, face-swapping, and synthetic media creation have advanced significantly, lowering both financial and technical barriers for their use. While these technologies present innovative opportunities, their rapid growth raises concerns about trust, privacy, and security. This white paper explores the implications of deepfake technology, analyzing its role in enabling fraud, misinformation, and the erosion of authenticity in multimedia. Using cost-effective, easy to use tools such as Runway, Rope, and ElevenLabs, we explore how realistic deepfakes can be created with limited resources, demonstrating the risks posed to individuals and organizations alike. By analyzing the technical and ethical challenges of deepfake mitigation and detection, we emphasize the urgent need for regulatory frameworks, public awareness, and collaborative efforts to maintain trust in digital media.
Exploring User Security and Privacy Attitudes and Concerns Toward the Use of General-Purpose LLM Chatbots for Mental Health
Kwesi, Jabari, Cao, Jiaxun, Manchanda, Riya, Emami-Naeini, Pardis
Individuals are increasingly relying on large language model (LLM)-enabled conversational agents for emotional support. While prior research has examined privacy and security issues in chatbots specifically designed for mental health purposes, these chatbots are overwhelmingly "rule-based" offerings that do not leverage generative AI. Little empirical research currently measures users' privacy and security concerns, attitudes, and expectations when using general-purpose LLM-enabled chatbots to manage and improve mental health. Through 21 semi-structured interviews with U.S. participants, we identified critical misconceptions and a general lack of risk awareness. Participants conflated the human-like empathy exhibited by LLMs with human-like accountability and mistakenly believed that their interactions with these chatbots were safeguarded by the same regulations (e.g., HIPAA) as disclosures with a licensed therapist. We introduce the concept of "intangible vulnerability," where emotional or psychological disclosures are undervalued compared to more tangible forms of information (e.g., financial or location-based data). To address this, we propose recommendations to safeguard user mental health disclosures with general-purpose LLM-enabled chatbots more effectively.
Thinking Machines Lab Raises a Record 2 Billion, Announces Cofounders
Thinking Machines Lab, an artificial intelligence company founded by top researchers who fled OpenAI, has raised a record 2 billion seed round that values the fledgling firm at 12 billion. The funding round was led by Andreessen Horowitz and included Nvidia, Accel, Cisco, and AMD--among others. The mammoth investment reflects the ultracompetitive race to build advanced AI systems, as well as the premium placed on top AI talent. It is the largest seed funding round in history. Thinking Machines is led by CEO Mira Murati, who stepped down as OpenAI's chief technology officer last September.
What It's Like to Be a Student Who Hates ChatGPT
Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. As a classically trained singer preparing for a professional career, Erin Perry can see quite clearly how artificial intelligence is upending her field--all the way down to the classroom. Perry just completed her first year as a graduate student in voice performance at the Peabody Institute, the prestigious music conservatory run by Johns Hopkins University. It's been rewarding so far: She's been learning how to navigate the modern classical music sector and confronting the relevant impacts of generative A.I., having taken on a project to study the major record labels' lawsuit against the Amazon-backed A.I. startup Anthropic, which trained its models on songwriters' lyrics sans permission or compensation. Understandably, Perry's rather skeptical of A.I.'s artistic applications, and fearful of the sweeping effects it could have on her chosen field, especially as generative-music startups like Suno and Udio are programmed to replicate specific artists and musical styles.
A Generalization Theory for Zero-Shot Prediction
In 2021, OpenAI shocked the world by improving the zero-shot classification accuracy on ImageNet from 11.5% to 76.2% via the CLIP series of models (Radford et al., 2021). This event redefined the goal of zero-shot prediction from producing models that generalized to unseen classes to those that generalized to unseen tasks entirely. Two fundamental drivers of CLIP's success were 1) the use of natural language as a medium for representing arbitrary classes (as in the previous state-of-the-art Visual N-grams (Li et al., 2017)), and 2) a massive, yet carefully designed pre-training set which significantly impacted downstream performance Radford et al. (2021); Fang et al. (2023); Xu et al. (2024). Despite the remarkable success of these foundation model-based pipelines Bommasani et al. (2022), there are unique components of zero-shot prediction that warrant investigation from a theoretical point of view. To clarify these gaps, we contrast zero-shot prediction (ZSP) with the related setting of few-shot learning (FSL). Let x X denote an input (often an image) that accompanies a discrete value y Y (often a class label).
An Empirical Evaluation of AI-Powered Non-Player Characters' Perceived Realism and Performance in Virtual Reality Environments
Korkiakoski, Mikko, Sheikhi, Saeid, Nyman, Jesper, Saariniemi, Jussi, Tapio, Kalle, Kostakos, Panos
Advancements in artificial intelligence (AI) have significantly enhanced the realism and interactivity of non-player characters (NPCs) in virtual reality (VR), creating more engaging and believable user experiences. This paper evaluates AI-driven NPCs within a VR interrogation simulator, focusing on their perceived realism, usability, and system performance. The simulator features two AI-powered NPCs, a suspect, and a partner, using GPT-4 Turbo to engage participants in a scenario to determine the suspect's guilt or innocence. A user study with 18 participants assessed the system using the System Usability Scale (SUS), Game Experience Questionnaire (GEQ), and a Virtual Agent Believability Questionnaire, alongside latency measurements for speech-to-text (STT), text-to-speech (TTS), OpenAI GPT-4 Turbo, and overall (cycle) latency. Results showed an average cycle latency of 7 seconds, influenced by the increasing conversational context. Believability scored 6.67 out of 10, with high ratings in behavior, social relationships, and intelligence but moderate scores in emotion and personality. The system achieved a SUS score of 79.44, indicating good usability. These findings demonstrate the potential of large language models to improve NPC realism and interaction in VR while highlighting challenges in reducing system latency and enhancing emotional depth. This research contributes to the development of more sophisticated AI-driven NPCs, revealing the need for performance optimization to achieve increasingly immersive virtual experiences.