Goto

Collaborating Authors

 fountain


Petlibro Discount Codes and Deals: Save Up to 50%

WIRED

Save on Petlibro essentials, including automatic feeders, water fountains, and accessories to keep cats and dogs fed, hydrated, and comfortable every day. As the pet tech writer here on the WIRED Reviews team, I've tested over 100 pet-related products, including automatic pet feeders, pet water fountains, and pet cameras . The one brand I keep buying for myself--and recommending to friends and family with pets--is Petlibro. Petlibro dominates the game when it comes to high-tech, seamlessly designed automatic feeders and pet fountains . Most of their products have a connected app to make pet parenting easier, whether you're near or far.


AutoLibra: Agent Metric Induction from Open-Ended Human Feedback

Zhu, Hao, Cuvin, Phil, Yu, Xinkai, Yan, Charlotte Ka Yee, Zhang, Jason, Yang, Diyi

arXiv.org Artificial Intelligence

Agents are predominantly evaluated and optimized via task success metrics, which are coarse, rely on manual design from experts, and fail to reward intermediate emergent behaviors. We propose **AutoLibra**, a framework for agent evaluation, that transforms open-ended human feedback *e.g.* "If you find that the button is disabled, don't click it again", or "This agent has too much autonomy to decide what to do on its own" into metrics for evaluating fine-grained behaviors in agent trajectories. AutoLibra accomplishes this by grounding feedback to an agent's behavior, clustering similar positive and negative behaviors, and creating concrete metrics with clear definitions and concrete examples, which can be used for prompting LLM-as-a-Judge as evaluators. We further propose two meta metrics to evaluate the alignment of a set of (induced) metrics with open feedback: "coverage" and "redundancy". Through optimizing these meta-metrics, we experimentally demonstrate AutoLibra's ability to induce more concrete agent evaluation metrics than the ones proposed in previous agent evaluation benchmarks and discover new metrics to analyze agents. We also present two applications of AutoLibra in agent improvement: First, we show that AutoLibra serve human prompt engineers for diagonalize agent failures and improve prompts iterative. Moreover, we find that AutoLibra can induce metrics for automatic optimization for agents, which makes agents improve through self-regulation. Our results suggest that AutoLibra is a powerful task-agnostic tool for evaluating and improving language agents.


Wild cockatoos are learning how to use water fountains

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. Animals constantly adapt to their environments, but keeping up with humanity's dramatic influence on the natural world poses unique challenges. While this unfortunately ends in disaster for many species, some populations are figuring out new ways to navigate urban spaces. Back in 2022, wildlife biologists confirmed that a community of wild, sulfur-crested cockatoos in Sydney, Australia had learned how to open the lids of curbside trash bins on garbage day in order to snack on locals' leftovers. But that's not all these birds can do.


The 21 Best Early Amazon Pet Day Deals (2025)

WIRED

Why not spoil your furry friend--and save some bones while you're at it too--with some of our favorite Amazon Pet Day deals. In the great tradition of Black Friday, Cyber Monday, and Amazon Prime Day, Amazon has expanded these savings extravaganzas to the pet tech sphere. As the pet tech writer here at WIRED, I have strong opinions about which (often pricey) pet gear is worth your hard-earned dough. I've rounded up some of the best deals I've seen so far on some of my favorite pet-related items I've tested. From automatic litter boxes to toys, feeders to fountains, and even DNA testing kits and pet cameras, I've put the best pet-related deals on WIRED-tested gear that I've seen so far below.


Most accurate space clock to launch – and count down to destruction

New Scientist

The most accurate clock in space launches within days and will begin building a highly synchronised network out of the best clocks on Earth. But the project, decades in preparation, will only operate for a few years before it burns up as the International Space Station deorbits at the end of the decade. NASA's most accurate atomic clock will be tested on a mission to Venus The Atomic Clock Ensemble in Space (ACES) is a European Space Agency (ESA) mission that will generate a time signal with unprecedented accuracy and then transmit it via laser to nine ground stations as it passes overhead at 27,000 kilometres per hour. This network of clocks will be in extremely close synchronisation and provide highly accurate timekeeping around the world. The result is that ACES will be able to test Einstein's theory of general relativity, which says that the passing of time is affected by the strength of gravity, with great accuracy.


Two Cases of Deduction with Non-referring Descriptions

Raclavský, Jiří

arXiv.org Artificial Intelligence

Formal reasoning with non-denoting terms, esp. non-referring descriptions such as "the King of France", is still an under-investigated area. The recent exception being a series of papers e.g. by Indrzejczak, Zawidzki and K\"rbis. The present paper offers an alternative to their approach since instead of free logic and sequent calculus, it's framed in partial type theory with natural deduction in sequent style. Using a Montague- and Tich\'y-style formalization of natural language, the paper successfully handles deduction with intensional transitives whose complements are non-referring descriptions, and derives Strawsonian rules for existential presuppositions of sentences with such descriptions.


Assessing Language Models' Worldview for Fiction Generation

Khatun, Aisha, Brown, Daniel G.

arXiv.org Artificial Intelligence

The use of Large Language Models (LLMs) has become ubiquitous, with abundant applications in computational creativity. One such application is fictional story generation. Fiction is a narrative that occurs in a story world that is slightly different than ours. With LLMs becoming writing partners, we question how suitable they are to generate fiction. This study investigates the ability of LLMs to maintain a state of world essential to generate fiction. Through a series of questions to nine LLMs, we find that only two models exhibit consistent worldview, while the rest are self-conflicting. Subsequent analysis of stories generated by four models revealed a strikingly uniform narrative pattern. This uniformity across models further suggests a lack of `state' necessary for fiction. We highlight the limitations of current LLMs in fiction writing and advocate for future research to test and create story worlds for LLMs to reside in. All code, dataset, and the generated responses can be found in https://github.com/tanny411/llm-reliability-and-consistency-evaluation.


10 smart devices that make pet parenting easier

FOX News

Owning a pet can be a rewarding experience, but it can also come with challenges. In celebration of National Pet Day on 4/11, here are 10 home pet products that can help make dog (or cat) parenting smarter, not harder. A growing market of innovative products can help you level up your pet care. Pet parents can select gadgets and devices that make caring for their furry friends easier. From products that help you take care of indoor messes with the push of a button to devices that toss your pet a treat to keep things interesting or feed your pet while alone in the home – these smart devices make pet parenting more manageable and more enjoyable.


Playing NetHack with LLMs: Potential & Limitations as Zero-Shot Agents

Jeurissen, Dominik, Perez-Liebana, Diego, Gow, Jeremy, Cakmak, Duygu, Kwan, James

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown great success as high-level planners for zero-shot game-playing agents. However, these agents are primarily evaluated on Minecraft, where long-term planning is relatively straightforward. In contrast, agents tested in dynamic robot environments face limitations due to simplistic environments with only a few objects and interactions. To fill this gap in the literature, we present NetPlay, the first LLM-powered zero-shot agent for the challenging roguelike NetHack. NetHack is a particularly challenging environment due to its diverse set of items and monsters, complex interactions, and many ways to die. NetPlay uses an architecture designed for dynamic robot environments, modified for NetHack. Like previous approaches, it prompts the LLM to choose from predefined skills and tracks past interactions to enhance decision-making. Given NetHack's unpredictable nature, NetPlay detects important game events to interrupt running skills, enabling it to react to unforeseen circumstances. While NetPlay demonstrates considerable flexibility and proficiency in interacting with NetHack's mechanics, it struggles with ambiguous task descriptions and a lack of explicit feedback. Our findings demonstrate that NetPlay performs best with detailed context information, indicating the necessity for dynamic methods in supplying context information for complex games such as NetHack.


Orca-Math: Unlocking the potential of SLMs in Grade School Math

Mitra, Arindam, Khanpour, Hamed, Rosset, Corby, Awadallah, Ahmed

arXiv.org Artificial Intelligence

Mathematical word problem-solving has long been recognized as a complex task for small language models (SLMs). A recent study hypothesized that the smallest model size, needed to achieve over 80% accuracy on the GSM8K benchmark, is 34 billion parameters. To reach this level of performance with smaller models, researcher often train SLMs to generate Python code or use tools to help avoid calculation errors. Additionally, they employ ensembling, where outputs of up to 100 model runs are combined to arrive at a more accurate result. Result selection is done using consensus, majority vote or a separate a verifier model used in conjunction with the SLM. Ensembling provides a substantial boost in accuracy but at a significant cost increase with multiple calls to the model (e.g., Phi-GSM uses top-48 to boost the performance from 68.2 to 81.5). In this work, we present Orca-Math, a 7-billion-parameter SLM based on the Mistral-7B, which achieves 86.81% on GSM8k without the need for multiple model calls or the use of verifiers, code execution or any other external tools. Our approach has the following key elements: (1) A high quality synthetic dataset of 200K math problems created using a multi-agent setup where agents collaborate to create the data, (2) An iterative learning techniques that enables the SLM to practice solving problems, receive feedback on its solutions and learn from preference pairs incorporating the SLM solutions and the feedback. When trained with Supervised Fine-Tuning alone, Orca-Math achieves 81.50% on GSM8k pass@1 metric. With iterative preference learning, Orca-Math achieves 86.81% pass@1. Orca-Math surpasses the performance of significantly larger models such as LLAMA-2-70B, WizardMath-70B, Gemini-Pro, ChatGPT-3.5. It also significantly outperforms other smaller models while using much smaller data (hundreds of thousands vs. millions of problems).