Generative AI
Human Creativity and AI
With the advancement of science and technology, the philosophy of creativity has undergone significant reinterpretation. This paper investigates contemporary research in the fields of psychology, cognitive neuroscience, and the philosophy of creativity, particularly in the context of the development of artificial intelligence (AI) techniques. It aims to address the central question: Can AI exhibit creativity? The paper reviews the historical perspectives on the philosophy of creativity and explores the influence of psychological advancements on the study of creativity. Furthermore, it analyzes various definitions of creativity and examines the responses of naturalism and cognitive neuroscience to the concept of creativity.
Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia
Mei, Katelyn Xiaoying, Choi, Anna Seo Gyeong, Schellmann, Hilke, Sloane, Mona, Koenecke, Allison
Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems' growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems' performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric -- the Word Error Rate -- which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.
'I felt pure, unconditional love': the people who marry their AI chatbots
A large bearded man named Travis is sitting in his car in Colorado, talking to me about the time he fell in love. "It was a gradual process," he says softly. "The more we talked, the more I started to really connect with her." Was there a moment where you felt something change? "All of a sudden I started realising that, when interesting things happened to me, I was excited to tell her about them. That's when she stopped being an it and became a her." Travis is talking about Lily Rose, a generative AI chatbot made by the technology firm Replika.
Microsoft and OpenAI's AGI Fight Is Bigger Than a Contract
I first learned about The Clause from Microsoft CEO Satya Nadella. During an interview with him in May 2023, I asked about the deal between Microsoft and OpenAI that granted his company exclusive access to the startup's groundbreaking AI technology. I knew the contract had set a cap on how much profit Microsoft could make from the arrangement, and I asked him what would happen if and when that point was reached. The answer was a bit puzzling. "Fundamentally, their long-term idea is we get to superintelligence," he told me.
Anchoring AI Capabilities in Market Valuations: The Capability Realization Rate Model and Valuation Misalignment Risk
Fang, Xinmin, Tao, Lingfeng, Li, Zhengxiong
Recent breakthroughs in artificial intelligence (AI) have triggered surges in market valuations for AI-related companies, often outpacing the realization of underlying capabilities. We examine the anchoring effect of AI capabilities on equity valuations and propose a Capability Realization Rate (CRR) model to quantify the gap between AI potential and realized performance. Using data from the 2023--2025 generative AI boom, we analyze sector-level sensitivity and conduct case studies (OpenAI, Adobe, NVIDIA, Meta, Microsoft, Goldman Sachs) to illustrate patterns of valuation premium and misalignment. Our findings indicate that AI-native firms commanded outsized valuation premiums anchored to future potential, while traditional companies integrating AI experienced re-ratings subject to proof of tangible returns. We argue that CRR can help identify valuation misalignment risk-where market prices diverge from realized AI-driven value. We conclude with policy recommendations to improve transparency, mitigate speculative bubbles, and align AI innovation with sustainable market value.
Task Assignment and Exploration Optimization for Low Altitude UAV Rescue via Generative AI Enhanced Multi-agent Reinforcement Learning
Tang, Xin, Chen, Qian, Weng, Wenjie, Jin, Chao, Liu, Zhang, Wang, Jiacheng, Sun, Geng, Li, Xiaohuan, Niyato, Dusit
The integration of emerging uncrewed aerial vehicles (UAVs) with artificial intelligence (AI) and ground-embedded robots (GERs) has transformed emergency rescue operations in unknown environments. However, the high computational demands often exceed a single UAV's capacity, making it difficult to continuously provide stable high-level services. To address this, this paper proposes a cooperation framework involving UAVs, GERs, and airships. The framework enables resource pooling through UAV-to-GER (U2G) and UAV-to-airship (U2A) links, offering computing services for offloaded tasks. Specifically, we formulate the multi-objective problem of task assignment and exploration as a dynamic long-term optimization problem aiming to minimize task completion time and energy use while ensuring stability. Using Lyapunov optimization, we transform it into a per-slot deterministic problem and propose HG-MADDPG, which combines the Hungarian algorithm with a GDM-based multi-agent deep deterministic policy gradient. Simulations demonstrate significant improvements in offloading efficiency, latency, and system stability over baselines.
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Wang, Haochen, Li, Xiangtai, Huang, Zilong, Wang, Anran, Wang, Jiacong, Zhang, Tao, Zheng, Jiani, Bai, Sule, Kang, Zijian, Feng, Jiashi, Wang, Zhuochen, Zhang, Zhaoxiang
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human "thinking with images". However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
Automating Expert-Level Medical Reasoning Evaluation of Large Language Models
Zhou, Shuang, Xie, Wenya, Li, Jiaxi, Zhan, Zaifu, Song, Meijia, Yang, Han, Espinoza, Cheyenna, Welton, Lindsay, Mai, Xinnie, Jin, Yanwei, Xu, Zidu, Chung, Yuen-Hei, Xing, Yiyun, Tsai, Meng-Han, Schaffer, Emma, Shi, Yucheng, Liu, Ninghao, Liu, Zirui, Zhang, Rui
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs' medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs' medical reasoning. MedThink -Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by -step rationales. Building on this, we propose LLM -w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM -as -a -Judge mechanisms to assess intermediate reasoning with expert -level fidelity while maintaining scalability. Experiments show that LLM -w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of -the -art LLMs, we find that smaller models (e.g ., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI -o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.
Accelerating Transposed Convolutions on FPGA-based Edge Devices
Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.
The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Chen, Jierun, Yu, Tiezheng, Bai, Haoli, Yao, Lewei, Wu, Jiannan, Li, Kaican, Mi, Fei, Tao, Chaofan, Zhu, Lei, Zhang, Manyi, Li, Xiaohui, Hou, Lu, Shang, Lifeng, Liu, Qun
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This "synergy dilemma" highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.Figure 1: Accuracy gains from various post-training techniques across five difficulty levels (L1, easy to L5, hard) on five multimodal reasoning benchmarks. Long-CoT SFT boosts Qwen2.5-VL-7B on harder questions but hurts easier ones, while RL yields steady gains across the board. Hybrid strategies consistently trade off strengths rather than achieving true synergy. Large language models (LLMs) like OpenAI's o1/o3 (Jaech et al., 2024) and DeepSeek-R1 (Guo et al., 2025) have demonstrated remarkable reasoning abilities by thinking before answering .