Goto

Collaborating Authors

 Generative AI


What's Producible May Not Be Reachable: Measuring the Steerability of Generative Models

arXiv.org Artificial Intelligence

How should we evaluate the quality of generative models? Many existing metrics focus on a model's producibility, i.e. the quality and breadth of outputs it can generate. However, the actual value from using a generative model stems not just from what it can produce but whether a user with a specific goal can produce an output that satisfies that goal. We refer to this property as steerability. In this paper, we first introduce a mathematical framework for evaluating steerability independently from producibility. Steerability is more challenging to evaluate than producibility because it requires knowing a user's goals. We address this issue by creating a benchmark task that relies on one key idea: sample an output from a generative model and ask users to reproduce it. We implement this benchmark in a large-scale user study of text-to-image models and large language models. Despite the ability of these models to produce high-quality outputs, they all perform poorly on steerabilty. This suggests that we need to focus on improving the steerability of generative models. We show such improvements are indeed possible: through reinforcement learning techniques, we create an alternative steering mechanism for image models that achieves more than 2x improvement on this benchmark.


In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

arXiv.org Artificial Intelligence

The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.


Measuring the Robustness of Audio Deepfake Detectors

arXiv.org Artificial Intelligence

Deepfakes have become a universal and rapidly intensifying concern of generative AI across various media types such as images, audio, and videos. Among these, audio deepfakes have been of particular concern due to the ease of high-quality voice synthesis and distribution via platforms such as social media and robocalls. Consequently, detecting audio deepfakes plays a critical role in combating the growing misuse of AI-synthesized speech. However, real-world scenarios often introduce various audio corruptions, such as noise, modification, and compression, that may significantly impact detection performance. This work systematically evaluates the robustness of 10 audio deepfake detection models against 16 common corruptions, categorized into noise perturbation, audio modification, and compression. Using both traditional deep learning models and state-of-the-art foundation models, we make four unique observations. First, our findings show that while most models demonstrate strong robustness to noise, they are notably more vulnerable to modifications and compression, especially when neural codecs are applied. Second, speech foundation models generally outperform traditional models across most scenarios, likely due to their self-supervised learning paradigm and large-scale pre-training. Third, our results show that increasing model size improves robustness, albeit with diminishing returns. Fourth, we demonstrate how targeted data augmentation during training can enhance model resilience to unseen perturbations. A case study on political speech deepfakes highlights the effectiveness of foundation models in achieving high accuracy under real-world conditions. These findings emphasize the importance of developing more robust detection frameworks to ensure reliability in practical deployment settings.


Man files complaint after ChatGPT falsely said he killed his children

BBC News

Hallucinations are one of the main problems computer scientists are trying to solve when it comes to generative AI. These are when chatbots present false information as facts. Earlier this year, Apple suspended its Apple Intelligence news summary tool in the UK after it hallucinated false headlines and presented them as real news. Google's AI Gemini has also fallen foul of hallucination - last year it suggested sticking cheese to pizza using glue, and said geologists recommend humans eat one rock per day. ChatGPT has changed its model since Mr Holmen's search in August 2024, and now searches current news articles when it looks for relevant information.


ChatGPT reportedly accused innocent man of murdering his children

Engadget

It has been over two years since ChatGPT exploded onto the world stage and, while OpenAI has advanced it in many ways, there's still quite a few hurdles. Now, Austrian advocacy group Noyb has filed its second complaint against OpenAI for such hallucinations, naming a specific instance in which ChatGPT reportedly -- and wrongly -- stated that a Norwegian man was a murderer. To make matters, somehow, even worse, when this man asked ChatGPT what it knew about him, it reportedly stated that he was sentenced to 21 years in prison for killing two of his children and attempting to murder his third. The hallucination was also sprinkled with real information, including the number of children he had, their genders and the name of his home town. Noyb claims that this response put OpenAI in violation of GDPR.


The Unbelievable Scale of AI's Pirated-Books Problem

The Atlantic - Technology

Editor's note: This analysis is part of The Atlantic's investigation into the Library Genesis data set. You can access the search tool directly here. Find The Atlantic's search tool for movie and television writing used to train AI here. When employees at Meta started developing their flagship AI model, Llama 3, they faced a simple ethical question. The program would need to be trained on a huge amount of high-quality writing to be competitive with products such as ChatGPT, and acquiring all of that text legally could take time.


World Knowledge from AI Image Generation for Robot Control

arXiv.org Artificial Intelligence

Real images encode a lot of information about the world, such as how an object can look like, how certain things can be meaningfully arranged, or which items belong together. The image of an average office desk can give us information about how the different parts are usually arranged in relation to each other, e.g. a monitor on the desk with mouse and keyboard in front of it and a chair in front of the desk, or the image of someone preparing a meal can give us information about which ingredients and kitchen tools are to be used. This might seem rather trivial from a human perspective as we are very easily capable of handling such tasks without having to rely on pre-made example images to follow, but for a robot that has to navigate and solve tasks in e.g. a household environment such information can be critical for successfully handling everyday-activities and interacting with the world. We could encode all relevant information explicitly into an extensive knowledge base [1] for the robot, but considering the number of tasks and circumstances that a robot could encounter, correctly handling all situations could become very challenging [2] or even overwhelming when the robot needs to act in widely different environments. Additional knowledge sources, such as simulations of the environment, when available, can help by providing ways to investigate consequences of actions without having to act in the world [3]. We could also try to train the robot on a variety of different tasks, e.g. using reinforcement learning or other methods [4], hoping that the robot is able to generalize and handle situations and circumstances that were never seen during training. However, images of the real world already show examples of how a dining table looks like with plates and cutlery, how images are hung on the wall in bedrooms, dining rooms, or other places. Figure 1 shows an example of two different versions of how sandwich ingredients could be stacked together.


Big Help or Big Brother? Auditing Tracking, Profiling, and Personalization in Generative AI Assistants

arXiv.org Artificial Intelligence

Generative AI (GenAI) browser assistants integrate powerful capabilities of GenAI in web browsers to provide rich experiences such as question answering, content summarization, and agentic navigation. These assistants, available today as browser extensions, can not only track detailed browsing activity such as search and click data, but can also autonomously perform tasks such as filling forms, raising significant privacy concerns. It is crucial to understand the design and operation of GenAI browser extensions, including how they collect, store, process, and share user data. To this end, we study their ability to profile users and personalize their responses based on explicit or inferred demographic attributes and interests of users. We perform network traffic analysis and use a novel prompting framework to audit tracking, profiling, and personalization by the ten most popular GenAI browser assistant extensions. We find that instead of relying on local in-browser models, these assistants largely depend on server-side APIs, which can be auto-invoked without explicit user interaction. When invoked, they collect and share webpage content, often the full HTML DOM and sometimes even the user's form inputs, with their first-party servers. Some assistants also share identifiers and user prompts with third-party trackers such as Google Analytics. The collection and sharing continues even if a webpage contains sensitive information such as health or personal information such as name or SSN entered in a web form. We find that several GenAI browser assistants infer demographic attributes such as age, gender, income, and interests and use this profile--which carries across browsing contexts--to personalize responses. In summary, our work shows that GenAI browser assistants can and do collect personal and sensitive information for profiling and personalization with little to no safeguards.


Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond

arXiv.org Artificial Intelligence

Recently, Test-Time Scaling Large Language Models (LLMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated exceptional capabilities across various domains and tasks, particularly in reasoning. While these models have shown impressive performance on general language tasks, their effectiveness in specialized fields like legal remains unclear. To address this, we present a preliminary evaluation of LLMs in various legal scenarios, covering both Chinese and English legal tasks. Our analysis includes 9 LLMs and 17 legal tasks, with a focus on newly published and more complex challenges such as multi-defendant legal judgments and legal argument reasoning. Our findings indicate that, despite DeepSeek-R1 and OpenAI o1 being among the most powerful models, their legal reasoning capabilities are still lacking. Specifically, these models score below 80\% on seven Chinese legal reasoning tasks and below 80\% on two English legal reasoning tasks. This suggests that, even among the most advanced reasoning models, legal reasoning abilities remain underdeveloped.


OpenAI's Deep Research Agent Is Coming for White-Collar Work

WIRED

Isla Fulford, a researcher at OpenAI, had a hunch that Deep Research would be a hit even before it was released. Fulford had helped build the artificial intelligence agent, which autonomously explores the web, deciding for itself what links to click, what to read, and what to collate into an in-depth report. OpenAI first made Deep Research available internally; whenever it went down, Fulford says, she was inundated with queries from colleagues eager to have it back. "The number of people who were DMing me made us pretty excited," says Fulford. Since going live to the public on February 2, Deep Research has proven to be a hit with many users outside the company too.