boulder
AutoLibra: Agent Metric Induction from Open-Ended Human Feedback
Zhu, Hao, Cuvin, Phil, Yu, Xinkai, Yan, Charlotte Ka Yee, Zhang, Jason, Yang, Diyi
Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.
- Asia > Middle East > Jordan (0.04)
- North America > United States > Virginia (0.04)
- North America > United States > Pennsylvania (0.04)
- (8 more...)
- Education (0.92)
- Leisure & Entertainment > Games (0.67)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.34)
What lies beneath: Scientists discover a giant granite slab half the size of WALES hidden under the West Antarctic Ice Sheet
Melania Trump accused of'calculated campaign to destroy' notorious biographer in lawsuit claiming she sabotaged tell-all on First Lady Young Americans identifying as trans or nonbinary in FREEFALL as experts pinpoint what's behind the shift Prince Andrew will be summoned to give evidence on Jeffrey Epstein to US Congress committee as victim says shamed royal should'do right' by Virginia Guiffre and testify What Britney Spears is really like behind closed doors: For first time, Kevin Federline reveals secrets he refused to spill even for $1 million... including'terrifying' acts that left their children running to him The real story behind Jim Carrey's disappearance: He once made $20m per film. Now insiders tell TOM LEONARD about the mysterious suicide of his married lover and claims of autism'cure' at the heart of his Hollywood downfall Is Meghan about to launch a new'Kardashian-style' mega brand? Duchess cosies up to CEO behind Kim Kardashian's wildly successful Skims range as speculation about her new venture grows Women's tennis in'manliness' row: World's No 1 and 2 come under fire from rival for their'high testosterone' - before Aryna Sabalenka appears to fire back after being labelled a'big' player Harvey Weinstein's ex-wife Georgina Chapman is facing foreclosure on $2.5 million NYC home Suzanne Somers' widower shocks fans as he resurrects star in'AI clone' format: 'You can't tell the difference' Vicious catfight erupts between Trump's leading ladies. Feud is talk of White House: 'It's real and it's personal' Karoline Leavitt goes scorched earth on'bitter' Biden press secretary over'deplorable' comments Three brutal words in my best friend's wedding invite cut like a knife. Meghan's hit a trashy new low.
- Europe > United Kingdom > Wales (0.41)
- Antarctica (0.41)
- North America > United States > Virginia (0.24)
- (24 more...)
- Media > Television (1.00)
- Media > Music (1.00)
- Media > Film (1.00)
- (7 more...)
- Information Technology > Communications > Social Media (1.00)
- Information Technology > Artificial Intelligence (1.00)
- Information Technology > Communications > Mobile (0.68)
Agentic Design of Compositional Machines
Zhang, Wenqian, Liu, Weiyang, Liu, Zhen
The design of complex machines stands as both a marker of human intelligence and a foundation of engineering practice. Given recent advances in large language models (LLMs), we ask whether they, too, can learn to create. We approach this question through the lens of compositional machine design: a task in which machines are assembled from standardized components to meet functional demands like locomotion or manipulation in a simulated physical environment. With this simplification, machine design is expressed as writing XML-like code that explicitly specifies pairwise part connections. To support this investigation, we introduce BesiegeField, a testbed built on the machine-building game Besiege, which enables part-based construction, physical simulation and reward-driven evaluation. Using BesiegeField, we benchmark state-of-the-art LLMs with agentic workflows and identify key capabilities required for success, including spatial reasoning, strategic assembly, and instruction-following. As current open-source models fall short, we explore reinforcement learning (RL) as a path to improvement: we curate a cold-start dataset, conduct RL finetuning experiments, and highlight open challenges at the intersection of language, machine design, and physical reasoning.
- North America > Mexico > Gulf of Mexico (0.14)
- Asia > China > Hong Kong (0.04)
- Asia > Vietnam > South China Sea (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
Towards Learning Boulder Excavation with Hydraulic Excavators
Gruetter, Jonas, Terenzi, Lorenzo, Egli, Pascal, Hutter, Marco
Construction sites frequently require removing large rocks before excavation or grading can proceed. Human operators typically extract these boulders using only standard digging buckets, avoiding time-consuming tool changes to specialized grippers. This task demands manipulating irregular objects with unknown geometries in harsh outdoor environments where dust, variable lighting, and occlusions hinder perception. The excavator must adapt to varying soil resistance--dragging along hard-packed surfaces or penetrating soft ground--while coordinating multiple hydraulic joints to secure rocks using a shovel. Current autonomous excavation focuses on continuous media (soil, gravel) or uses specialized grippers with detailed geometric planning for discrete objects. These approaches either cannot handle large irregular rocks or require impractical tool changes that interrupt workflow. We train a reinforcement learning policy in simulation using rigid-body dynamics and analytical soil models. The policy processes sparse LiDAR points (just 20 per rock) from vision-based segmentation and proprioceptive feedback to control standard excavator buckets. The learned agent discovers different strategies based on soil resistance: dragging along the surface in hard soil and penetrating directly in soft conditions. Field tests on a 12-ton excavator achieved 70% success across varied rocks (0.4-0.7m) and soil types, compared to 83% for human operators. This demonstrates that standard construction equipment can learn complex manipulation despite sparse perception and challenging outdoor conditions.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > Canada > British Columbia > Vancouver Island > Capital Regional District > Victoria (0.04)
- Asia > Middle East > Jordan (0.04)
Here's how to generate a truly random number with quantum physics
Breakthroughs, discoveries, and DIY tips sent every weekday. Very little in this life is truly random. A coin flip is influenced by the flipper's force, its surrounding airflow, and gravity. Similar variables dictate rolling a pair of dice or shuffling a deck of cards, while even classical computing's cryptographic algorithms are theoretically susceptible to outside influence or bias. "True randomness is something that nothing in the universe can predict in advance," explained Krister Shalm, a physicist at the National Institute of Standards and Technology (NIST).
Data-Driven Optimization of EV Charging Station Placement Using Causal Discovery
Junker, Julius Stephan, Hu, Rong, Li, Ziyue, Ketter, Wolfgang
This paper addresses the critical challenge of optimizing electric vehicle charging station placement through a novel data-driven methodology employing causal discovery techniques. While traditional approaches prioritize economic factors or power grid constraints, they often neglect empirical charging patterns that ultimately determine station utilization. We analyze extensive charging data from Palo Alto and Boulder (337,344 events across 100 stations) to uncover latent relationships between station characteristics and utilization. Applying structural learning algorithms (NOTEARS and DAGMA) to this data reveals that charging demand is primarily determined by three factors: proximity to amenities, EV registration density, and adjacency to high-traffic routes. These findings, consistent across multiple algorithms and urban contexts, challenge conventional infrastructure distribution strategies. We develop an optimization framework that translates these insights into actionable placement recommendations, identifying locations likely to experience high utilization based on the discovered dependency structures. The resulting site selection model prioritizes strategic clustering in high-amenity areas with substantial EV populations rather than uniform spatial distribution. Our approach contributes a framework that integrates empirical charging behavior into infrastructure planning, potentially enhancing both station utilization and user convenience. By focusing on data-driven insights instead of theoretical distribution models, we provide a more effective strategy for expanding charging networks that can adjust to various stages of EV market development.
- North America > United States > California > Santa Clara County > Palo Alto (0.28)
- North America > United States > Colorado > Boulder County > Boulder (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (2 more...)
- Transportation > Infrastructure & Services (1.00)
- Transportation > Ground > Road (1.00)
- Transportation > Electric Vehicle (1.00)
Learning to Control an Android Robot Head for Facial Animation
Heisler, Marcel, Becker-Asano, Christian
The ability to display rich facial expressions is crucial for human-like robotic heads. While manually defining such expressions is intricate, there already exist approaches to automatically learn them. In this work one such approach is applied to evaluate and control a robot head different from the one in the original study. To improve the mapping of facial expressions from human actors onto a robot head, it is proposed to use 3D landmarks and their pairwise distances as input to the learning algorithm instead of the previously used facial action units. Participants of an online survey preferred mappings from our proposed approach in most cases, though there are still further improvements required.
- North America > United States > Colorado > Boulder County > Boulder (0.06)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
- Asia > China > Shaanxi Province > Xi'an (0.05)
- (5 more...)
Deep learning waterways for rural infrastructure development
Pierson, Matthew, Mehrabi, Zia
Surprisingly a number of Earth's waterways remain unmapped, with a significant number in low and middle income countries. Here we build a computer vision model (WaterNet) to learn the location of waterways in the United States, based on high resolution satellite imagery and digital elevation models, and then deploy this in novel environments in the African continent. Our outputs provide detail of waterways structures hereto unmapped. When assessed against community needs requests for rural bridge building related to access to schools, health care facilities and agricultural markets, we find these newly generated waterways capture on average 93% (country range: 88-96%) of these requests whereas Open Street Map, and the state of the art data from TDX-Hydro, capture only 36% (5-72%) and 62% (37% - 85%), respectively. Because these new machine learning enabled maps are built on public and operational data acquisition this approach offers promise for capturing humanitarian needs and planning for social development in places where cartographic efforts have so far failed to deliver. The improved performance in identifying community needs missed by existing data suggests significant value for rural infrastructure development and better targeting of development interventions.
- Africa > Ethiopia (0.05)
- Africa > Rwanda (0.05)
- Africa > Côte d'Ivoire (0.05)
- (16 more...)
- Social Sector (0.66)
- Education (0.54)
- Energy > Renewable > Geothermal > Geothermal Energy Exploration and Development > Geophysical Analysis & Survey (0.35)
The Legend of Zelda: Echoes of Wisdom plays like a traditional Zelda game, remixed
The Legend of Zelda: Echoes of Wisdom feels like a kindred spirit to the 2019 remake of Link's Awakening, both in challenge and in vibes. It's a far cry from the incredibly intricate and complex worlds in Tears of the Kingdom, and while I only played for about 90 minutes (spread over two different parts of the game),I came away from the demo charmed by the gorgeous, tilt-shift art style. Not to mention being quite pleased to finally be playing as Zelda for the first time in the series that bears her damn name. And while plenty of adults will surely enjoy The Legend of Zelda: Echoes of Wisdom, it also feels tailor-made as an entry point for younger players. We already knew about the art style and playing as Zelda -- what was most important about this preview was that I got a chance to see just how Zelda's "echoes" worked in the game itself.
Value Internalization: Learning and Generalizing from Social Reward
Rong, Frieda, Kleiman-Weiner, Max
Social rewards shape human behavior. During development, a caregiver guides a learner's behavior towards culturally aligned goals and values. How do these behaviors persist and generalize when the caregiver is no longer present, and the learner must continue autonomously? Here, we propose a model of value internalization where social feedback trains an internal social reward (ISR) model that generates internal rewards when social rewards are unavailable. Through empirical simulations, we show that an ISR model prevents agents from unlearning socialized behaviors and enables generalization in out-of-distribution tasks. We characterize the implications of incomplete internalization, akin to "reward hacking" on the ISR. Additionally, we show that our model internalizes prosocial behavior in a multi-agent environment. Our work provides a foundation for understanding how humans acquire and generalize values and offers insights for aligning AI with human values.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > France (0.04)
- Asia > Middle East > Jordan (0.04)