toaster
Gear News of the Week: The iPhone Air Is Surprisingly Repairable, and Gemini Comes to Google TV
Plus: Withings collabs with Clue to offer advanced women's cycle tracking, there's a new Balmuda toaster, and Shokz shows off Dolby Audio-powered open earbuds. All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. Thinner, smaller gadgets are usually harder to repair due to their constrained space, but surprise, surprise, Apple's 5.6 mm-thin iPhone Air has earned a respectable 7/10 repair score from iFixit . A key factor in this was Apple relocating the logic board to create more space for the battery, making it easier to access.
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
Kim, Geon-Hyeong, Jang, Youngsoo, Kim, Yu Jin, Kim, Byoungjip, Lee, Honglak, Bae, Kyunghoon, Lee, Moontae
As Large Language Models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into Reinforcement Learning from Human Feedback (RLHF). However, these approaches tend to be complex, as they encompass complicated procedures in RLHF along with additional steps required by the safety constraints. Inspired by Direct Preference Optimization (DPO), we introduce a new algorithm called SafeDPO, which is designed to directly optimize the safety alignment objective in a single stage of policy learning, without requiring relaxation. SafeDPO introduces only one additional hyperparameter to further enhance safety and requires only minor modifications to standard DPO. As a result, it eliminates the need to fit separate reward and cost models or to sample from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to state-of-the-art safety alignment algorithms, both in terms of aligning with human preferences and improving safety.
WildIFEval: Instruction Following in the Wild
Lior, Gili, Yehudai, Asaf, Gera, Ariel, Ein-Dor, Liat
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
Non-literal Understanding of Number Words by Language Models
Tsvilodub, Polina, Gandhi, Kanishk, Zhao, Haoran, Frรคnken, Jan-Philipp, Franke, Michael, Goodman, Noah D.
Humans naturally interpret numbers non-literally, effortlessly combining context, world knowledge, and speaker intent. We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects. Through systematic comparison with human data and computational models of pragmatic reasoning, we find that LLMs diverge from human interpretation in striking ways. By decomposing pragmatic reasoning into testable components, grounded in the Rational Speech Act framework, we pinpoint where LLM processing diverges from human cognition -- not in prior knowledge, but in reasoning with it. This insight leads us to develop a targeted solution -- chain-of-thought prompting inspired by an RSA model makes LLMs' interpretations more human-like. Our work demonstrates how computational cognitive models can both diagnose AI-human differences and guide development of more human-like language understanding capabilities.
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Ferraz, Thomas Palmeira, Mehta, Kartik, Lin, Yu-Hsiang, Chang, Haw-Shiuan, Oraby, Shereen, Liu, Sijia, Subramanian, Vivek, Chung, Tagyoung, Bansal, Mohit, Peng, Nanyun
Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.
A Dialogue Game for Eliciting Balanced Collaboration
Jekniฤ, Isidora, Schlangen, David, Koller, Alexander
Collaboration is an integral part of human dialogue. Typical task-oriented dialogue games assign asymmetric roles to the participants, which limits their ability to elicit naturalistic role-taking in collaboration and its negotiation. We present a novel and simple online setup that favors balanced collaboration: a two-player 2D object placement game in which the players must negotiate the goal state themselves. We show empirically that human players exhibit a variety of role distributions, and that balanced collaboration improves task performance. We also present an LLM-based baseline agent which demonstrates that automatic playing of our game is an interesting challenge for artificial systems.
Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models
Sarch, Gabriel, Wu, Yue, Tarr, Michael J., Fragkiadaki, Katerina
Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.
Amazon's new AI tool conjures fake backgrounds for real products
Amazon is rolling out a new beta feature that lets advertisers create AI-generated image backgrounds for products. The company describes it as "a generative AI solution designed to remove creative barriers" while boosting ad performance. "It's a perfect use for generative AI -- less effort and better outcomes," Colleen Aubrey, senior vice president of Amazon Ads Products and Technology, wrote Wednesday in an announcement blog post. The company views the feature as an ideal alternative to product shots in front of generic white backgrounds (or bad Photoshop jobs). Amazon says the process is easy and requires no technical expertise.
Integrating Symbolic Reasoning into Neural Generative Models for Design Generation
Jacobson, Maxwell Joseph, Xue, Yexiang
Design generation requires tight integration of neural and symbolic reasoning, as good design must meet explicit user needs and honor implicit rules for aesthetics, utility, and convenience. Current automated design tools driven by neural networks produce appealing designs, but cannot satisfy user specifications and utility requirements. Symbolic reasoning tools, such as constraint programming, cannot perceive low-level visual information in images or capture subtle aspects such as aesthetics. We introduce the Spatial Reasoning Integrated Generator (SPRING) for design generation. SPRING embeds a neural and symbolic integrated spatial reasoning module inside the deep generative network. The spatial reasoning module decides the locations of objects to be generated in the form of bounding boxes, which are predicted by a recurrent neural network and filtered by symbolic constraint satisfaction. Embedding symbolic reasoning into neural generation guarantees that the output of SPRING satisfies user requirements. Furthermore, SPRING offers interpretability, allowing users to visualize and diagnose the generation process through the bounding boxes. SPRING is also adept at managing novel user specifications not encountered during its training, thanks to its proficiency in zero-shot constraint transfer. Quantitative evaluations and a human study reveal that SPRING outperforms baseline generative models, excelling in delivering high design quality and better meeting user specifications.
Raising the steaks! World's first AI-powered grill promises to cook the perfect steak in just 90 seconds - but it has an eye-watering $3,500 price tag
Whether it's too tough, burnt to a crisp or just dripping in fat, cooking steak on the outside grill rarely does the cut of meat justice. Thankfully, a British firm has created an artificial intelligence (AI)-powered grill that it claims makes a perfect steak in just 90 seconds under controlled conditions. Perfecta, from Birmingham-based firm Seergrills, cooks the meat as it's held in place vertically, like a piece of bread in a toaster, with ultra-hot grills on either side. It has AI-powered software called NeuralFire, which relies on data gathered from sensors inside the machine and cooking preferences inputted by the user. However, if you want to get hold of one you'd better start saving - the device has an eye-watering $3,500 price tag.