Generative AI
ConsumerBench: Benchmarking Generative AI Applications on End-User Devices
Gu, Yile, Kadekodi, Rohan, Nguyen, Hoang, Kamahori, Keisuke, Liu, Yiyu, Kasikci, Baris
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.
Software Reuse in the Generative AI Era: From Cargo Cult Towards AI Native Software Engineering
Mikkonen, Tommi, Taivalsaari, Antero
Software development is currently under a paradigm shift in which Artificial Intelligence (AI) - in particular Generative AI [6] - has taken an increasingly central role in assisting developers in their software creation activities. This is in essence a new form of software reuse in which collections of previously created software artifacts form the basis for generating new ones. Unlike in the past when developers were manually searching for pre-existing software components from libraries and code repositories such as Github, Node Package Manager (NPM) or the Python Package Index (PyPI), in the new model developers are requesting AI-driven assistants to generate suitable pieces of code for them. These generated artifacts can range from small code snippets and module fragments to comprehensive application skeletons or in some cases fully functional applications or even complete end-to-end systems. This new generative approach to software reuse has resulted in a considerable mental model change for developers.
Two Sonification Methods for the MindCube
Liu, Fangzheng, Blanchard, Lancelot, Haddad, Don D., Paradiso, Joseph A.
In this work, we explore the musical interface potential of the MindCube, an interactive device designed to study emotions. Embedding diverse sensors and input devices, this interface resembles a fidget cube toy commonly used to help users relieve their stress and anxiety. As such, it is a particularly well-suited controller for musical systems that aim to help with emotion regulation. In this regard, we present two different mappings for the MindCube, with and without AI. With our generative AI mapping, we propose a way to infuse meaning within a latent space and techniques to navigate through it with an external controller. We discuss our results and propose directions for future work.
Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble
Kumthekar, Amit, Tilley, Zion, Duong, Henry, Patel, Bhargav, Magnoli, Michael, Omar, Ahmed, Nasser, Ahmed, Gharpure, Chaitanya, Reztzov, Yevgen
Despite the growing clinical adoption of large language models (LLMs), current approaches heavily rely on single model architectures. To overcome risks of obsolescence and rigid dependence on single model systems, we present a novel framework, termed the Consensus Mechanism. Mimicking clinical triage and multidisciplinary clinical decision-making, the Consensus Mechanism implements an ensemble of specialized medical expert agents enabling improved clinical decision making while maintaining robust adaptability. This architecture enables the Consensus Mechanism to be optimized for cost, latency, or performance, purely based on its interior model configuration. To rigorously evaluate the Consensus Mechanism, we employed three medical evaluation benchmarks: MedMCQA, MedQA, and MedXpertQA Text, and the differential diagnosis dataset, DDX+. On MedXpertQA, the Consensus Mechanism achieved an accuracy of 61.0% compared to 53.5% and 45.9% for OpenAI's O3 and Google's Gemini 2.5 Pro. Improvement was consistent across benchmarks with an increase in accuracy on MedQA ($ฮ\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 3.4\%$) and MedMCQA ($ฮ\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 9.1\%$). These accuracy gains extended to differential diagnosis generation, where our system demonstrated improved recall and precision (F1$_\mathrm{consensus}$ = 0.326 vs. F1$_{\mathrm{O3\text{-}high}}$ = 0.2886) and a higher top-1 accuracy for DDX (Top1$_\mathrm{consensus}$ = 52.0% vs. Top1$_{\mathrm{O3\text{-}high}}$ = 45.2%).
ShareGPT-4o-Image: Aligning Multimodal Models with GPT-4o-Level Image Generation
Chen, Junying, Cai, Zhenyang, Chen, Pengcheng, Chen, Shunian, Ji, Ke, Wang, Xidong, Yang, Yunjin, Wang, Benyou
Recent advances in multimodal generative models have unlocked photorealistic, instruction-aligned image generation, yet leading systems like GPT-4o-Image remain proprietary and inaccessible. To democratize these capabilities, we present ShareGPT-4o-Image, the first dataset comprising 45K text-to-image and 46K text-and-image-to-image data, all synthesized using GPT-4o's image generation capabilities for distilling its advanced image generation abilities. Leveraging this dataset, we develop Janus-4o, a multimodal large language model capable of both text-to-image and text-and-image-to-image generation. Janus-4o not only significantly improves text-to-image generation over its predecessor, Janus-Pro, but also newly supports text-and-image-to-image generation. Notably, it achieves impressive performance in text-and-image-to-image generation from scratch, using only 91K synthetic samples and 6 hours of training on an 8 A800-GPU machine. We hope the release of ShareGPT-4o-Image and Janus-4o will foster open research in photorealistic, instruction-aligned image generation.
Elon Musk's Lawyers Claim He 'Does Not Use a Computer'
Elon Musk's lawyers claimed that he "does not use a computer" in a Sunday court filing related to his lawsuit against Sam Altman and OpenAI. However, Musk has posted pictures or referred to his laptop on X several times in recent months, and public evidence suggests that he owns and appears to use at least one computer. Musk and his artificial intelligence startup xAI sued OpenAI in February 2024, alleging the company committed breach of contract by abandoning its founding agreement to develop AI "for the benefit of humanity," choosing instead "to maximize profits for Microsoft." The Sunday court filing was submitted in opposition to a Friday filing from OpenAI, which accused Musk and xAI of failing to fully comply with the discovery process. OpenAI alleges that Musk's counsel does not plan to collect any documents from him.
Millions Use It Every Day. It's One of the Internet's Most Important Websites. Bots Are Destroying It, Piece by Piece.
Sign up for the Slatest to get the most insightful analysis, criticism, and advice out there, delivered to your inbox daily. In the years since ChatGPT's debut transformed Silicon Valley into an artificial intelligence hype factory, the internet's most vibrant communities have puzzled over how to adapt to the ensuing deluge of A.I. slop, especially as autogenerated outputs become more sophisticated. Perhaps no platform exemplifies this conundrum better than Reddit, the anonymized message-board network that's been connecting millions of humans across the world for 20 years now--as many users there increasingly wonder whether they are, indeed, still connecting with other humans. Such concerns aren't new, but they've been heightened thanks to a shocking exercise of A.I.-powered manipulation. In late April, the moderation team for the popular subreddit r/ChangeMyView disclosed that researchers from the University of Zurich had conducted an "unauthorized experiment" on community members that "deployed AI-generated comments to study how AI could be used to change views."
OpenAI takes down mentions of Jony Ive's io amid trademark row
OpenAI has taken down online content related to its recent deal with Sir Jony Ive's hardware startup, io, after a trademark complaint. The artificial intelligence company has removed promotional materials including a video where Ive โ the former Apple designer behind the iPhone โ and OpenAI's chief executive, Sam Altman, discuss the 6.4bn ( 4.8bn) transaction. However, the nine-minute film can still be viewed on YouTube. OpenAI, the developer of ChatGPT, was forced to act after receiving a legal complaint from iyO, a startup that makes artificial intelligence-backed earbuds. OpenAI said it had taken down a page on its website announcing the company's acquisition of io, which will involve Ive's company taking on creative and design leadership across the combined businesses.
OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents
Abhyankar, Reyna, Qi, Qi, Zhang, Yiying
Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning and reflection account for the majority of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld-Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld-Human and found that even the highest-scoring agents on OSWorld take 1.4-2.7x more steps than necessary.
UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
Lin, Bin, Li, Zongjian, Cheng, Xinhua, Niu, Yuwei, Ye, Yang, He, Xianyi, Yuan, Shenghai, Yu, Wangbo, Wang, Shaodong, Ge, Yunyang, Pang, Yatian, Yuan, Li
Although existing unified models achieve strong performance in vision-language understanding and text-to-image generation, they remain limited in addressing image perception and manipulation -- capabilities increasingly demanded in practical applications. Recently, OpenAI introduced the powerful GPT-4o-Image model, which showcases advanced capabilities in comprehensive image perception and manipulation, sparking widespread interest. Through carefully designed experiments, we observe that GPT-4o-Image likely relies on semantic encoders rather than VAEs for feature extraction, despite VAEs being commonly regarded as crucial for image manipulation tasks. Inspired by this insight, we propose UniWorld-V1, a unified generative framework built upon semantic features extracted from powerful multimodal large language models and contrastive semantic encoders. Using only 2.7M training data, UniWorld-V1 achieves impressive performance across diverse tasks, including image understanding, generation, manipulation, and perception. We fully open-source the UniWorld-V1 framework, including model weights, training and evaluation scripts, and datasets to promote reproducibility and further research.