Goto

Collaborating Authors

 toaster


Gear News of the Week: The iPhone Air Is Surprisingly Repairable, and Gemini Comes to Google TV

WIRED

Plus: Withings collabs with Clue to offer advanced women's cycle tracking, there's a new Balmuda toaster, and Shokz shows off Dolby Audio-powered open earbuds. All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. Thinner, smaller gadgets are usually harder to repair due to their constrained space, but surprise, surprise, Apple's 5.6 mm-thin iPhone Air has earned a respectable 7/10 repair score from iFixit . A key factor in this was Apple relocating the logic board to create more space for the battery, making it easier to access.


WildIFEval: Instruction Following in the Wild

Lior, Gili, Yehudai, Asaf, Gera, Ariel, Ein-Dor, Liat

arXiv.org Artificial Intelligence

Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 12K real user instructions with diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, in natural user prompts. We categorize these constraints into eight high-level classes to capture their distribution and dynamics in real-world scenarios. Leveraging WildIFEval, we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. Our findings reveal that all evaluated models experience performance degradation with an increasing number of constraints. Thus, we show that all models have a large room for improvement on such tasks. Moreover, we observe that the specific type of constraint plays a critical role in model performance. We release our dataset to promote further research on instruction-following under complex, realistic conditions.


Non-literal Understanding of Number Words by Language Models

Tsvilodub, Polina, Gandhi, Kanishk, Zhao, Haoran, Fränken, Jan-Philipp, Franke, Michael, Goodman, Noah D.

arXiv.org Artificial Intelligence

Humans naturally interpret numbers non-literally, effortlessly combining context, world knowledge, and speaker intent. We investigate whether large language models (LLMs) interpret numbers similarly, focusing on hyperbole and pragmatic halo effects. Through systematic comparison with human data and computational models of pragmatic reasoning, we find that LLMs diverge from human interpretation in striking ways. By decomposing pragmatic reasoning into testable components, grounded in the Rational Speech Act framework, we pinpoint where LLM processing diverges from human cognition -- not in prior knowledge, but in reasoning with it. This insight leads us to develop a targeted solution -- chain-of-thought prompting inspired by an RSA model makes LLMs' interpretations more human-like. Our work demonstrates how computational cognitive models can both diagnose AI-human differences and guide development of more human-like language understanding capabilities.


LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints

Ferraz, Thomas Palmeira, Mehta, Kartik, Lin, Yu-Hsiang, Chang, Haw-Shiuan, Oraby, Shereen, Liu, Sijia, Subramanian, Vivek, Chung, Tagyoung, Bansal, Mohit, Peng, Nanyun

arXiv.org Artificial Intelligence

Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions containing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances LLMs' ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model to decide when and where the LLM's response needs refinement. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with DeCRIM can outperform GPT-4 on both benchmarks.


A Dialogue Game for Eliciting Balanced Collaboration

Jeknić, Isidora, Schlangen, David, Koller, Alexander

arXiv.org Artificial Intelligence

Collaboration is an integral part of human dialogue. Typical task-oriented dialogue games assign asymmetric roles to the participants, which limits their ability to elicit naturalistic role-taking in collaboration and its negotiation. We present a novel and simple online setup that favors balanced collaboration: a two-player 2D object placement game in which the players must negotiate the goal state themselves. We show empirically that human players exhibit a variety of role distributions, and that balanced collaboration improves task performance. We also present an LLM-based baseline agent which demonstrates that automatic playing of our game is an interesting challenge for artificial systems.


Open-Ended Instructable Embodied Agents with Memory-Augmented Large Language Models

Sarch, Gabriel, Wu, Yue, Tarr, Michael J., Fragkiadaki, Katerina

arXiv.org Artificial Intelligence

Pre-trained and frozen large language models (LLMs) can effectively map simple scene rearrangement instructions to programs over a robot's visuomotor functions through appropriate few-shot example prompting. To parse open-domain natural language and adapt to a user's idiosyncratic procedures, not known during prompt engineering time, fixed prompts fall short. In this paper, we introduce HELPER, an embodied agent equipped with an external memory of language-program pairs that parses free-form human-robot dialogue into action programs through retrieval-augmented LLM prompting: relevant memories are retrieved based on the current dialogue, instruction, correction, or VLM description, and used as in-context prompt examples for LLM querying. The memory is expanded during deployment to include pairs of user's language and action plans, to assist future inferences and personalize them to the user's language and routines. HELPER sets a new state-of-the-art in the TEACh benchmark in both Execution from Dialog History (EDH) and Trajectory from Dialogue (TfD), with a 1.7x improvement over the previous state-of-the-art for TfD. Our models, code, and video results can be found in our project's website: https://helper-agent-llm.github.io.


Amazon's new AI tool conjures fake backgrounds for real products

Engadget

Amazon is rolling out a new beta feature that lets advertisers create AI-generated image backgrounds for products. The company describes it as "a generative AI solution designed to remove creative barriers" while boosting ad performance. "It's a perfect use for generative AI -- less effort and better outcomes," Colleen Aubrey, senior vice president of Amazon Ads Products and Technology, wrote Wednesday in an announcement blog post. The company views the feature as an ideal alternative to product shots in front of generic white backgrounds (or bad Photoshop jobs). Amazon says the process is easy and requires no technical expertise.


Integrating Symbolic Reasoning into Neural Generative Models for Design Generation

Jacobson, Maxwell Joseph, Xue, Yexiang

arXiv.org Artificial Intelligence

Design generation requires tight integration of neural and symbolic reasoning, as good design must meet explicit user needs and honor implicit rules for aesthetics, utility, and convenience. Current automated design tools driven by neural networks produce appealing designs, but cannot satisfy user specifications and utility requirements. Symbolic reasoning tools, such as constraint programming, cannot perceive low-level visual information in images or capture subtle aspects such as aesthetics. We introduce the Spatial Reasoning Integrated Generator (SPRING) for design generation. SPRING embeds a neural and symbolic integrated spatial reasoning module inside the deep generative network. The spatial reasoning module decides the locations of objects to be generated in the form of bounding boxes, which are predicted by a recurrent neural network and filtered by symbolic constraint satisfaction. Embedding symbolic reasoning into neural generation guarantees that the output of SPRING satisfies user requirements. Furthermore, SPRING offers interpretability, allowing users to visualize and diagnose the generation process through the bounding boxes. SPRING is also adept at managing novel user specifications not encountered during its training, thanks to its proficiency in zero-shot constraint transfer. Quantitative evaluations and a human study reveal that SPRING outperforms baseline generative models, excelling in delivering high design quality and better meeting user specifications.


Raising the steaks! World's first AI-powered grill promises to cook the perfect steak in just 90 seconds - but it has an eye-watering $3,500 price tag

Daily Mail - Science & tech

Whether it's too tough, burnt to a crisp or just dripping in fat, cooking steak on the outside grill rarely does the cut of meat justice. Thankfully, a British firm has created an artificial intelligence (AI)-powered grill that it claims makes a perfect steak in just 90 seconds under controlled conditions. Perfecta, from Birmingham-based firm Seergrills, cooks the meat as it's held in place vertically, like a piece of bread in a toaster, with ultra-hot grills on either side. It has AI-powered software called NeuralFire, which relies on data gathered from sensors inside the machine and cooking preferences inputted by the user. However, if you want to get hold of one you'd better start saving - the device has an eye-watering $3,500 price tag.


AI put me in a 'South Park' episode

Engadget

It was just another day in South Park. The kids were making fun of each other on the playground, while the parents were all doing their best to maintain their sanity in the small Colorado town. And then there was me, a tech journalist going door-to-door warning about the impending AI apocalypse. No, I wasn't actually guest starring on the long-running TV series -- I was thrust into an episode entirely produced by the Showrunner AI model from The Simulation, the next iteration of the VR studio Fable. All it took was some audio of my voice (recorded during a call with The Simulation's CEO Edward Saatchi), a picture and a two-sentence prompt to produce the episode.