Boxing
Robots square off in world's first humanoid boxing match
Breakthroughs, discoveries, and DIY tips sent every weekday. After decades of being tortured, shoved, kicked, burned, and bludgeoned, robots are finally getting their chance to fight back. This weekend, Chinese robotics maker Unitree says it will livestream the world's first boxing match between two of its humanoid robots. The event, titled Unitree Iron Fist King: Awakening, will feature a face-off between two of Unitree's 4.3-foot-tall G1 robots. The robots will reportedly be remotely controlled by human engineers, though they are also expected to demonstrate some autonomous, pre-programmed actions as well.
Gandalf the Red: Adaptive Security for LLMs
Pfister, Niklas, Volhejn, Vรกclav, Knott, Manuel, Arias, Santiago, Baziลska, Julia, Bichurin, Mykhailo, Commike, Alan, Darling, Janet, Dienes, Peter, Fiedler, Matthew, Haber, David, Kraft, Matthias, Lancini, Marco, Mathys, Max, Pascual-Ortiz, Damiรกn, Podolak, Jakub, Romero-Lรณpez, Adriร , Shiarlis, Kyriacos, Signer, Andreas, Terek, Zsolt, Theocharis, Athanasios, Timbrell, Daniel, Trautwein, Samuel, Watts, Samuel, Wu, Natalie, Rojas-Carulla, Mateo
Current evaluations of defenses against prompt attacks in large language model (LLM) applications often overlook two critical factors: the dynamic nature of adversarial behavior and the usability penalties imposed on legitimate users by restrictive defenses. We propose D-SEC (Dynamic Security Utility Threat Model), which explicitly separates attackers from legitimate users, models multi-step interactions, and rigorously expresses the security-utility in an optimizable form. We further address the shortcomings in existing evaluations by introducing Gandalf, a crowd-sourced, gamified red-teaming platform designed to generate realistic, adaptive attack datasets. Using Gandalf, we collect and release a dataset of 279k prompt attacks. Complemented by benign user data, our analysis reveals the interplay between security and utility, showing that defenses integrated in the LLM (e.g., system prompts) can degrade usability even without blocking requests. We demonstrate that restricted application domains, defense-in-depth, and adaptive defenses are effective strategies for building secure and useful LLM applications. Code is available at \href{https://github.com/lakeraai/dsec-gandalf}{\texttt{https://github.com/lakeraai/dsec-gandalf}}.
Return of EM: Entity-driven Answer Set Expansion for QA Evaluation
Lee, Dongryeol, Lee, Minwoo, Min, Kyungmin, Park, Joonsuk, Jung, Kyomin
Recently, directly using large language models (LLMs) has been shown to be the most reliable method to evaluate QA models. However, it suffers from limited interpretability, high cost, and environmental harm. To address these, we propose to use soft exact match (EM) with entitydriven answer set expansion. Our approach expands the gold answer set to include diverse surface forms, based on the observation that the surface forms often follow particular patterns depending on the entity type. The experimental results show that our method outperforms traditional evaluation methods by a large margin. Moreover, the reliability of our evaluation method is comparable to that of LLM-based ones, while offering the benefits of high interpretability and reduced environmental harm.
Motion Generation from Fine-grained Textual Descriptions
The task of text2motion is to generate human motion sequences from given textual descriptions, where the model explores diverse mappings from natural language instructions to human body movements. While most existing works are confined to coarse-grained motion descriptions, e.g., "A man squats.", fine-grained descriptions specifying movements of relevant body parts are barely explored. Models trained with coarse-grained texts may not be able to learn mappings from fine-grained motion-related words to motion primitives, resulting in the failure to generate motions from unseen descriptions. In this paper, we build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D, by feeding GPT-3.5-turbo with step-by-step instructions with pseudo-code compulsory checks. Accordingly, we design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information. Our quantitative evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines. According to the qualitative evaluation and case study, our model outperforms MotionDiffuse in generating spatially or chronologically composite motions, by learning the implicit mappings from fine-grained descriptions to the corresponding basic motions. We release our data at https://github.com/KunhangL/finemotiondiffuse.
Modeling Unified Semantic Discourse Structure for High-quality Headline Generation
Xu, Minghui, Fei, Hao, Li, Fei, Wu, Shengqiong, Sun, Rui, Teng, Chong, Ji, Donghong
Headline generation aims to summarize a long document with a short, catchy title that reflects the main idea. This requires accurately capturing the core document semantics, which is challenging due to the lengthy and background information-rich na ture of the texts. In this work, We propose using a unified semantic discourse structure (S3) to represent document semantics, achieved by combining document-level rhetorical structure theory (RST) trees with sentence-level abstract meaning representation (AMR) graphs to construct S3 graphs. The hierarchical composition of sentence, clause, and word intrinsically characterizes the semantic meaning of the overall document. We then develop a headline generation framework, in which the S3 graphs are encoded as contextual features. To consolidate the efficacy of S3 graphs, we further devise a hierarchical structure pruning mechanism to dynamically screen the redundant and nonessential nodes within the graph. Experimental results on two headline generation datasets demonstrate that our method outperforms existing state-of-art methods consistently. Our work can be instructive for a broad range of document modeling tasks, more than headline or summarization generation.
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Rein, David, Hou, Betty Li, Stickland, Asa Cooper, Petty, Jackson, Pang, Richard Yuanzhe, Dirani, Julien, Michael, Julian, Bowman, Samuel R.
We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
PreWoMe: Exploiting Presuppositions as Working Memory for Long Form Question Answering
Han, Wookje, Park, Jinsol, Lee, Kyungjae
Information-seeking questions in long-form question answering (LFQA) often prove misleading due to ambiguity or false presupposition in the question. While many existing approaches handle misleading questions, they are tailored to limited questions, which are insufficient in a real-world setting with unpredictable input characteristics. In this work, we propose PreWoMe, a unified approach capable of handling any type of information-seeking question. The key idea of PreWoMe involves extracting presuppositions in the question and exploiting them as working memory to generate feedback and action about the question. Our experiment shows that PreWoMe is effective not only in tackling misleading questions but also in handling normal ones, thereby demonstrating the effectiveness of leveraging presuppositions, feedback, and action for real-world QA settings.
Aligning Language Models with Preferences through f-divergence Minimization
Go, Dongyoung, Korbak, Tomasz, Kruszewski, Germรกn, Rozen, Jos, Ryu, Nahyeon, Dymetman, Marc
Aligning language models with preferences can be posed as approximating a target distribution representing some desired behavior. Existing approaches differ both in the functional form of the target distribution and the algorithm used to approximate it. For instance, Reinforcement Learning from Human Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target distribution arising from a KL penalty in the objective. On the other hand, Generative Distributional Control (GDC) has an explicit target distribution and minimizes a forward KL from it using the Distributional Policy Gradient (DPG) algorithm. In this paper, we propose a new approach, f-DPG, which allows the use of any f-divergence to approximate any target distribution that can be evaluated. f-DPG unifies both frameworks (RLHF, GDC) and the approximation methods (DPG, RL with KL penalties). We show the practical benefits of various choices of divergence objectives and demonstrate that there is no universally optimal objective but that different divergences present different alignment and diversity trade-offs. We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin, leading to significant improvements over prior work. These distinguishing characteristics between divergences persist as the model size increases, highlighting the importance of selecting appropriate divergence objectives.
FedDebug: Systematic Debugging for Federated Learning Applications
Gill, Waris, Anwar, Ali, Gulzar, Muhammad Ali
In Federated Learning (FL), clients train a model locally and share it with a central aggregator to build a global model. Impermissibility to access client's data and collaborative training makes FL appealing for applications with data-privacy concerns such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, finding the round and the clients responsible is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, FedDebug, that advances the FL debugging on two novel fronts. First, FedDebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. FedDebug's {\em breakpoint} can help inspect an FL state (round, client, and global model) and seamlessly move between rounds and clients' models, enabling a fine-grained step-by-step inspection. Second, FedDebug automatically identifies the client responsible for lowering global model's performance without any testing data and labels--both are essential for existing debugging techniques. FedDebug's strengths come from adapting differential testing in conjunction with neurons activations to determine the precise client deviating from normal behavior. FedDebug achieves 100\% to find a single client and 90.3\% accuracy to find multiple faulty clients. FedDebug's interactive debugging incurs 1.2\% overhead during training, while it localizes a faulty client in only 2.1\% of a round's training time. With FedDebug, we bring effective debugging practices to federated learning, improving the quality and productivity of FL application developers.