Personal
ARC Prize 2024: Technical Report
Chollet, Francois, Knoop, Mike, Kamradt, Gregory, Landers, Bryan
As of December 2024, the ARC-AGI benchmark is five years old and remains unbeaten. We believe it is currently the most important unsolved AI benchmark in the world because it seeks to measure generalization on novel tasks -- the essence of intelligence -- as opposed to skill at tasks that can be prepared for in advance. This year, we launched ARC Prize, a global competition to inspire new ideas and drive open progress towards AGI by reaching a target benchmark score of 85\%. As a result, the state-of-the-art score on the ARC-AGI private evaluation set increased from 33\% to 55.5\%, propelled by several frontier AGI reasoning techniques including deep learning-guided program synthesis and test-time training. In this paper, we survey top approaches, review new open-source implementations, discuss the limitations of the ARC-AGI-1 dataset, and share key insights gained from the competition.
Engadget Podcast: We've survived two days of CES 2025
Devindra: We are here what is this, the beginning of night one of CES officially? Devindra: guess we have already suffered through basically day minus one. Devindra: One thing I want our listeners to understand is that we have already seen a lot of things we kind of know where the CES is headed. And, I think this is a cursed show Cherlynn. How do you feel about that? Yeah, I think I mean, Devindra, I'll let you speak to your situation, but we've had team members who have fallen deathly ill. We have also, like, people who have completely had to miss their flights, international flights. It's been quite Engadget team, but we have a really, really good team of people. Everyone's got great attitudes and, like, our spirits are high. You want to just get the stuff going.
Rescriber: Smaller-LLM-Powered User-Led Data Minimization for Navigating Privacy Trade-offs in LLM-Based Conversational Agent
Zhou, Jijie, Xu, Eryue, Wu, Yaoyao, Li, Tianshi
The proliferation of LLM-based conversational agents has resulted in excessive disclosure of identifiable or sensitive information. However, existing technologies fail to offer perceptible control or account for users' personal preferences about privacy-utility tradeoffs due to the lack of user involvement. To bridge this gap, we designed, built, and evaluated Rescriber, a browser extension that supports user-led data minimization in LLM-based conversational agents by helping users detect and sanitize personal information in their prompts. Our studies (N=12) showed that Rescriber helped users reduce unnecessary disclosure and addressed their privacy concerns. Users' subjective perceptions of the system powered by Llama3-8B were on par with that by GPT-4o. The comprehensiveness and consistency of the detection and sanitization emerge as essential factors that affect users' trust and perceived protection. Our findings confirm the viability of smaller-LLM-powered, user-facing, on-device privacy controls, presenting a promising approach to address the privacy and trust challenges of AI.
ParetoLens: A Visual Analytics Framework for Exploring Solution Sets of Multi-objective Evolutionary Algorithms
Ma, Yuxin, Zhang, Zherui, Cheng, Ran, Jin, Yaochu, Tan, Kay Chen
In the domain of multi-objective optimization, evolutionary algorithms are distinguished by their capability to generate a diverse population of solutions that navigate the trade-offs inherent among competing objectives. This has catalyzed the ascension of evolutionary multi-objective optimization (EMO) as a prevalent approach. Despite the effectiveness of the EMO paradigm, the analysis of resultant solution sets presents considerable challenges. This is primarily attributed to the high-dimensional nature of the data and the constraints imposed by static visualization methods, which frequently culminate in visual clutter and impede interactive exploratory analysis. To address these challenges, this paper introduces ParetoLens, a visual analytics framework specifically tailored to enhance the inspection and exploration of solution sets derived from the multi-objective evolutionary algorithms. Utilizing a modularized, algorithm-agnostic design, ParetoLens enables a detailed inspection of solution distributions in both decision and objective spaces through a suite of interactive visual representations. This approach not only mitigates the issues associated with static visualizations but also supports a more nuanced and flexible analysis process. The usability of the framework is evaluated through case studies and expert interviews, demonstrating its potential to uncover complex patterns and facilitate a deeper understanding of multi-objective optimization solution sets. A demo website of ParetoLens is available at https://dva-lab.org/paretolens/.
Pax Americana persists: American freedoms and creativity have led to unrivaled prosperity throughout the world
Fox News co-anchor John Roberts has the latest after President-elect Donald Trump nominates Lt. Gen. Keith Kellogg to serve as assistant to the president and special envoy to Ukraine and Russia on'America Reports.' In the ancient world, the Pax Romana was a legendary historical period during which the western world, under the influence of the Roman Empire, enjoyed 200 years of relative peace, stability and prosperity. Commencing its founding under Caesar Augustus and ending with the death of Emperor Marcus Aurelius, the Pax Romana was marked by lower levels of violence, increasing trade and territorial expansion that saw peak Rome preside over around one-third of the global population. Since that time, there have been a number of eras so similarly named, but none as dynamic as the current one: Pax Americana. Typically dated from the conclusion of World War II in 1945, the Pax Americana is the era of peace, prosperity and progress American power has offered the world since partnering with our allies to slay fascism and confront communism.
2024 digest of digests
So much has happened in the AI space over the course of the past 12 months. We've reported on some of the larger, and lesser-covered, stories in our regular monthly digests. We look back through the archives and pick out one story from each of our digests. Interview with Bo Li: A comprehensive assessment of trustworthiness in GPT models Bo Li and colleagues won an outstanding datasets and benchmark track award at NeurIPS 2023 for their work DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. In this interview, Bo told us about the research, the team's methodology, and key findings. AAAI 2024 takes place This month saw the running of the 38th Annual AAAI Conference.
RESTOR: Knowledge Recovery through Machine Unlearning
Rezaei, Keivan, Chandu, Khyathi, Feizi, Soheil, Choi, Yejin, Brahman, Faeze, Ravichander, Abhilasha
Large language models trained on web-scale corpora can memorize undesirable datapoints such as incorrect facts, copyrighted content or sensitive data. Recently, many machine unlearning algorithms have been proposed that aim to `erase' these datapoints from trained models -- that is, revert model behavior to be similar to a model that had never been trained on these datapoints. However, evaluating the success of unlearning algorithms remains an open challenge. In this work, we propose the RESTOR framework for machine unlearning, which evaluates the ability of unlearning algorithms to perform targeted data erasure from models, by evaluating the ability of models to forget the knowledge introduced in these data points, while simultaneously recovering the model's knowledge state had it not encountered these datapoints. RESTOR helps uncover several novel insights about popular unlearning algorithms, and the mechanisms through which they operate -- for instance, identifying that some algorithms merely emphasize forgetting, and that localizing unlearning targets can enhance unlearning performance.
Digital Guardians: Can GPT-4, Perspective API, and Moderation API reliably detect hate speech in reader comments of German online newspapers?
Weber, Manuel, Huber, Moritz, Auch, Maximilian, Dรถschl, Alexander, Keller, Max-Emanuel, Mandl, Peter
In recent years, toxic content and hate speech have become widespread phenomena on the internet. Moderators of online newspapers and forums are now required, partly due to legal regulations, to carefully review and, if necessary, delete reader comments. This is a labor-intensive process. Some providers of large language models already offer solutions for automated hate speech detection or the identification of toxic content. These include GPT-4o from OpenAI, Jigsaw's (Google) Perspective API, and OpenAI's Moderation API. Based on the selected German test dataset HOCON34k, which was specifically created for developing tools to detect hate speech in reader comments of online newspapers, these solutions are compared with each other and against the HOCON34k baseline. The test dataset contains 1,592 annotated text samples. For GPT-4o, three different promptings are used, employing a Zero-Shot, One-Shot, and Few-Shot approach. The results of the experiments demonstrate that GPT-4o outperforms both the Perspective API and the Moderation API, and exceeds the HOCON34k baseline by approximately 5 percentage points, as measured by a combined metric of MCC and F2-score.
PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents
Lee, Jingoo, Lim, Kyungho, Jung, Young-Chul, Kim, Byung-Hoon
Recent advances in large language models (LLMs) have accelerated the development of conversational agents capable of generating human-like responses. Since psychiatric assessments typically involve complex conversational interactions between psychiatrists and patients, there is growing interest in developing LLM-based psychiatric assessment conversational agents (PACAs) that aim to simulate the role of psychiatrists in clinical evaluations. However, standardized methods for benchmarking the clinical appropriateness of PACAs' interaction with patients still remain underexplored. Here, we propose PSYCHE, a novel framework designed to enable the 1) clinically relevant, 2) ethically safe, 3) cost-efficient, and 4) quantitative evaluation of PACAs. This is achieved by simulating psychiatric patients based on a multi-faceted psychiatric construct that defines the simulated patients' profiles, histories, and behaviors, which PACAs are expected to assess. We validate the effectiveness of PSYCHE through a study with 10 board-certified psychiatrists, supported by an in-depth analysis of the simulated patient utterances.
Proactive Conversational Agents with Inner Thoughts
Liu, Xingyu Bruce, Fang, Shitao, Shi, Weiyan, Wu, Chien-Sheng, Igarashi, Takeo, Chen, Xiang `Anthony'
One of the long-standing aspirations in conversational AI is to allow them to autonomously take initiatives in conversations, i.e., being proactive. This is especially challenging for multi-party conversations. Prior NLP research focused mainly on predicting the next speaker from contexts like preceding conversations. In this paper, we demonstrate the limitations of such methods and rethink what it means for AI to be proactive in multi-party, human-AI conversations. We propose that just like humans, rather than merely reacting to turn-taking cues, a proactive AI formulates its own inner thoughts during a conversation, and seeks the right moment to contribute. Through a formative study with 24 participants and inspiration from linguistics and cognitive psychology, we introduce the Inner Thoughts framework. Our framework equips AI with a continuous, covert train of thoughts in parallel to the overt communication process, which enables it to proactively engage by modeling its intrinsic motivation to express these thoughts. We instantiated this framework into two real-time systems: an AI playground web app and a chatbot. Through a technical evaluation and user studies with human participants, our framework significantly surpasses existing baselines on aspects like anthropomorphism, coherence, intelligence, and turn-taking appropriateness.