Generative AI
PALATE: Peculiar Application of the Law of Total Expectation to Enhance the Evaluation of Deep Generative Models
Dziarmaga, Tadeusz, Kฤ dzioลka, Marcin, Kasymov, Artur, Mazur, Marcin
Deep generative models (DGMs) have caused a paradigm shift in the field of machine learning, yielding noteworthy advancements in domains such as image synthesis, natural language processing, and other related areas. However, a comprehensive evaluation of these models that accounts for the trichotomy between fidelity, diversity, and novelty in generated samples remains a formidable challenge. A recently introduced solution that has emerged as a promising approach in this regard is the Feature Likelihood Divergence (FLD), a method that offers a theoretically motivated practical tool, yet also exhibits some computational challenges. In this paper, we propose PALATE, a novel enhancement to the evaluation of DGMs that addresses limitations of existing metrics. Our approach is based on a peculiar application of the law of total expectation to random variables representing accessible real data. When combined with the MMD baseline metric and DINOv2 feature extractor, PALATE offers a holistic evaluation framework that matches or surpasses state-of-the-art solutions while providing superior computational efficiency and scalability to large-scale datasets. Through a series of experiments, we demonstrate the effectiveness of the PALATE enhancement, contributing a computationally efficient, holistic evaluation approach that advances the field of DGMs assessment, especially in detecting sample memorization and evaluating generalization capabilities.
Generative AI in Knowledge Work: Design Implications for Data Navigation and Decision-Making
Yun, Bhada, Feng, Dana, Chen, Ace S., Nikzad, Afshin, Salehi, Niloufar
Our study of 20 knowledge workers revealed a common challenge: the difficulty of synthesizing unstructured information scattered across multiple platforms to make informed decisions. Drawing on their vision of an ideal knowledge synthesis tool, we developed Yodeai, an AI-enabled system, to explore both the opportunities and limitations of AI in knowledge work. Through a user study with 16 product managers, we identified three key requirements for Generative AI in knowledge work: adaptable user control, transparent collaboration mechanisms, and the ability to integrate background knowledge with external information. However, we also found significant limitations, including overreliance on AI, user isolation, and contextual factors outside the AI's reach. As AI tools become increasingly prevalent in professional settings, we propose design principles that emphasize adaptability to diverse workflows, accountability in personal and collaborative contexts, and context-aware interoperability to guide the development of human-centered AI systems for product managers and knowledge workers.
OpenAI's Sora Is Plagued by Sexist, Racist, and Ableist Biases
Despite recent leaps forward in image quality, the biases found in videos generated by AI tools, like OpenAI's Sora, are as conspicuous as ever. A WIRED investigation, which included a review of hundreds of AI-generated videos, has found that Sora's model perpetuates sexist, racist, and ableist stereotypes in its results. In Sora's world, everyone is good-looking. Pilots, CEOs, and college professors are men, while flight attendants, receptionists, and childcare workers are women. Disabled people are wheelchair users, interracial relationships are tricky to generate, and fat people don't run.
HH4AI: A methodological Framework for AI Human Rights impact assessment under the EUAI ACT
Ceravolo, Paolo, Damiani, Ernesto, D'Amico, Maria Elisa, Erb, Bianca de Teffe, Favaro, Simone, Fiano, Nannerel, Gambatesa, Paolo, La Porta, Simone, Maghool, Samira, Mauri, Lara, Panigada, Niccolo, Vaquer, Lorenzo Maria Ratto, Tamborini, Marta A.
This paper introduces the HH4AI Methodology, a structured approach to assessing the impact of AI systems on human rights, focusing on compliance with the EU AI Act and addressing technical, ethical, and regulatory challenges. The paper highlights AIs transformative nature, driven by autonomy, data, and goal-oriented design, and how the EU AI Act promotes transparency, accountability, and safety. A key challenge is defining and assessing "high-risk" AI systems across industries, complicated by the lack of universally accepted standards and AIs rapid evolution. To address these challenges, the paper explores the relevance of ISO/IEC and IEEE standards, focusing on risk management, data quality, bias mitigation, and governance. It proposes a Fundamental Rights Impact Assessment (FRIA) methodology, a gate-based framework designed to isolate and assess risks through phases including an AI system overview, a human rights checklist, an impact assessment, and a final output phase. A filtering mechanism tailors the assessment to the system's characteristics, targeting areas like accountability, AI literacy, data governance, and transparency. The paper illustrates the FRIA methodology through a fictional case study of an automated healthcare triage service. The structured approach enables systematic filtering, comprehensive risk assessment, and mitigation planning, effectively prioritizing critical risks and providing clear remediation strategies. This promotes better alignment with human rights principles and enhances regulatory compliance.
A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models
Xie, Zuan, Xu, Yang, Xu, Hongli, Liao, Yunming, Yao, Zhiwei
Abstract--Recent advancements in large language models (LLMs) have catalyzed a substantial surge in demand for LLM services. While traditional cloud-based LLM services satisfy high-accuracy requirements, they fall short in meeting critical demands for low delay and enhanced privacy . T o address these limitations, we propose HA T, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding. HA T partitions the LLM into three submodels, and the input and output submodels, stacked with a lightweight adapter network, are deployed as a small language model (SLM) on each end device. Meanwhile, the middle submodel, encompassing the majority of the LLM's decoder layers, is hosted in the cloud to perform speculative decoding with on-device SLMs. During inference, HA T exchanges hidden states (rather than raw tokens) of input or draft tokens between devices and the cloud, thereby incurring substantial communication delays. Besides, processing hidden states of long prompts will exacerbate computation delays in the cloud, further compromising inference efficiency . T o improve efficiency, we introduce a prompt chunking mechanism that segments long prompts into shorter chunks, enabling parallel transmission and processing. Furthermore, HA T is implemented to dynamically determine optimal chunk sizes for devices handling long prompts, thereby improving overall inference speed. Extensive experiments are conducted on a physical testbed comprising 30 NVIDIA Jetson devices and a server with 8 NVIDIA A6000 GPUs. Experimental results demonstrate that HA T achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines. Recent advancements in large language models (LLMs) have revolutionized the field of natural language processing, demonstrating unprecedented capabilities across various tasks and triggering exponential growth of LLM services [1], [2]. For instance, OpenAI's ChatGPT provides various services, e.g., chat-based interaction, and automated writing, to approximately 180 million users, and processes over 1.6 billion requests monthly [3]. The underlying architecture of LLM services mainly operates through an autore-gressive process, which involves a prefill phase followed by a decode phase. In prefill phase, the LLM processes all input prompt tokens simultaneously, leveraging parallel computation to generate the initial output token.
GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks
Krechetova, Varvara, Kochedykov, Denis
In this paper, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.
Collaborating with AI Agents: Field Experiments on Teamwork, Productivity, and Performance
To uncover how AI agents change productivity, performance, and work processes, we introduce MindMeld: an experimentation platform enabling humans and AI agents to collaborate in integrative workspaces. In a large-scale marketing experiment on the platform, 2310 participants were randomly assigned to human-human and human-AI teams, with randomized AI personality traits. The teams exchanged 183,691 messages, and created 63,656 image edits, 1,960,095 ad copy edits, and 10,375 AI-generated images while producing 11,138 ads for a large think tank. Analysis of fine-grained communication, collaboration, and workflow logs revealed that collaborating with AI agents increased communication by 137% and allowed humans to focus 23% more on text and image content generation messaging and 20% less on direct text editing. Humans on Human-AI teams sent 23% fewer social messages, creating 60% greater productivity per worker and higher-quality ad copy. In contrast, human-human teams produced higher-quality images, suggesting that AI agents require fine-tuning for multimodal workflows. AI personality prompt randomization revealed that AI traits can complement human personalities to enhance collaboration. For example, conscientious humans paired with open AI agents improved image quality, while extroverted humans paired with conscientious AI agents reduced the quality of text, images, and clicks. In field tests of ad campaigns with ~5M impressions, ads with higher image quality produced by human collaborations and higher text quality produced by AI collaborations performed significantly better on click-through rate and cost per click metrics. Overall, ads created by human-AI teams performed similarly to those created by human-human teams. Together, these results suggest AI agents can improve teamwork and productivity, especially when tuned to complement human traits.
Adoption of Watermarking for Generative AI Systems in Practice and Implications under the new EU AI Act
Rijsbosch, Bram, van Dijck, Gijs, Kollnig, Konrad
AI-generated images have become so good in recent years that individuals cannot distinguish them any more from "real" images. This development creates a series of societal risks, and challenges our perception of what is true and what is not, particularly with the emergence of "deep fakes" that impersonate real individuals. Watermarking, a technique that involves embedding identifying information within images to indicate their AI-generated nature, has emerged as a primary mechanism to address the risks posed by AI-generated images. The implementation of watermarking techniques is now becoming a legal requirement in many jurisdictions, including under the new 2024 EU AI Act. Despite the widespread use of AI image generation systems, the current status of watermarking implementation remains largely unexamined. Moreover, the practical implications of the AI Act's watermarking requirements have not previously been studied. The present paper therefore both provides an empirical analysis of 50 of the most widely used AI systems for image generation, and embeds this empirical analysis into a legal analysis of the AI Act. We identify four categories of generative AI image systems relevant under the AI Act, outline the legal obligations for each category, and find that only a minority number of providers currently implement adequate watermarking practices.
Scalable physics-informed deep generative model for solving forward and inverse stochastic differential equations
Zhou, Shaoqian, You, Wen, Guo, Ling, Meng, Xuhui
Physics-informed deep learning approaches have been developed to solve forward and inverse stochastic differential equation (SDE) problems with high-dimensional stochastic space. However, the existing deep learning models have difficulties solving SDEs with high-dimensional spatial space. In the present study, we propose a scalable physics-informed deep generative model (sPI-GeM), which is capable of solving SDE problems with both high-dimensional stochastic and spatial space. The sPI-GeM consists of two deep learning models, i.e., (1) physics-informed basis networks (PI-BasisNet), which are used to learn the basis functions as well as the coefficients given data on a certain stochastic process or random field, and (2) physics-informed deep generative model (PI-GeM), which learns the distribution over the coefficients obtained from the PI-BasisNet. The new samples for the learned stochastic process can then be obtained using the inner product between the output of the generator and the basis functions from the trained PI-BasisNet. The sPI-GeM addresses the scalability in the spatial space in a similar way as in the widely used dimensionality reduction technique, i.e., principal component analysis (PCA). A series of numerical experiments, including approximation of Gaussian and non-Gaussian stochastic processes, forward and inverse SDE problems, are performed to demonstrate the accuracy of the proposed model. Furthermore, we also show the scalability of the sPI-GeM in both the stochastic and spatial space using an example of a forward SDE problem with 38- and 20-dimension stochastic and spatial space, respectively.
Strategic Prompt Pricing for AIGC Services: A User-Centric Approach
Li, Xiang, Luo, Bing, Huang, Jianwei, Luo, Yuan
The rapid growth of AI-generated content (AIGC) services has created an urgent need for effective prompt pricing strategies, yet current approaches overlook users' strategic two-step decision-making process in selecting and utilizing generative AI models. This oversight creates two key technical challenges: quantifying the relationship between user prompt capabilities and generation outcomes, and optimizing platform payoff while accounting for heterogeneous user behaviors. We address these challenges by introducing prompt ambiguity, a theoretical framework that captures users' varying abilities in prompt engineering, and developing an Optimal Prompt Pricing (OPP) algorithm. Our analysis reveals a counterintuitive insight: users with higher prompt ambiguity (i.e., lower capability) exhibit non-monotonic prompt usage patterns, first increasing then decreasing with ambiguity levels, reflecting complex changes in marginal utility. Experimental evaluation using a character-level GPT-like model demonstrates that our OPP algorithm achieves up to 31.72% improvement in platform payoff compared to existing pricing mechanisms, validating the importance of user-centric prompt pricing in AIGC services.