Goto

Collaborating Authors

 puppy


How Doodles Became the Dog du Jour

The New Yorker

Poodle crossbreeds have grown overwhelmingly popular, sparking controversy in dog parks and kennel clubs alike. The features of doodles such as Peaches (above), a goldendoodle, have become the canine equivalent of Instagram face. Meet the Breeds, the American Kennel Club's annual showcase of purebred dogs, took place over two eye-wateringly cold days in early February at the Javits Center, in Manhattan. About a hundred and fifty of the two hundred and five varieties recognized as official breeds by the A.K.C., the long-standing authority in the U.S. dog world, were in attendance for the public to ogle, fondle, and coo "So cute!" to, including the basset fauve de Bretagne, a hunting hound from France that's one of three newly recognized breeds recently allowed into the purebred pantheon. Some of the dogs had competed in the Westminster Kennel Club Dog Show earlier in the week, and past champions had their ribbons on display. In spite of the frigid weather, pavilions hosting the more popular breeds--the pug, the Doberman pinscher, the Great Dane, the St. Bernard--were packed. Lesser-known varieties, such as the saluki, the Löwchen, and the Lapponian herder, drew sparser crowds. There were exhibition spaces for each breed, and on the back walls were three adjectives supposedly describing that particular type of dog's temperament. There is, in fact, no evidence that temperament is consistent within a breed, but the idea is deeply rooted in dogdom. I stopped to caress the velvety ear leather of a pharaoh hound ("Friendly, Smart, Noble"), a sprinting breed once used to hunt rabbits in Malta; accept kisses from a Portuguese water dog, bred to assist with retrieving tackle ("Affectionate, Adventurous, Athletic"); and have my photograph taken with a Leonberger, a German breed from the town of Leonberg, in southwest Germany ("Friendly, Gentle, Playful"). No one was supposed to be openly selling dogs, but, if you asked, the breeders would share their information. Excluding what are known as companion dogs, like the Leonberger, most of the animals at the show were designed for a purpose that is no longer required of them. In Great Britain, foxhounds are legally barred from chasing foxes. Consider the fate of the otterhound, an ancient variety with a noble heritage which was once used in the U.K. to hunt river otters, which were prized for their thick fur and disliked by wealthy landowners because they ate fish in their stocked ponds.


Our Son Just Discovered a Rude Hand Gesture. My Husband Is Thoroughly Amused. I Am Not.

Slate

My Husband Is Thoroughly Amused. After he spends time with his dad, we're back at square one. Have a question for Care and Feeding? My 5-year-old son, "Jasper," has recently discovered flipping the bird. He loves to do it at every opportunity, which has made for some rather embarrassing situations, to put it mildly.


d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning

Zhao, Siyan, Gupta, Devaansh, Zheng, Qinqing, Grover, Aditya

arXiv.org Artificial Intelligence

Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO, the first integration of policy gradient methods to masked dLLMs. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM. Our code is released at https://dllm-reasoning.github.io/.


MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Jiang, Dongzhi, Zhang, Renrui, Guo, Ziyu, Li, Yanwei, Qi, Yu, Chen, Xinyan, Wang, Liuhui, Jin, Jianhan, Guo, Claire, Yan, Shen, Zhang, Bo, Fu, Chaoyou, Gao, Peng, Li, Hongsheng

arXiv.org Artificial Intelligence

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/


Robotic dog helps those facing mental health and cognitive challenges

FOX News

Jennie the artificial intelligence-powered robotic dog is designed to provide comfort and companionship to those with mental health challenges. U.S. robotics company Tombot has introduced Jennie, an innovative AI-powered robotic pet designed to provide comfort and companionship to those facing cognitive health challenges. This groundbreaking creation is set to transform the lives of millions struggling with dementia, mild cognitive impairment and various mental health issues. Jennie's inception stems from a personal tragedy experienced by Tombot CEO Tom Stevens. When his mother, Nancy, was diagnosed with Alzheimer's, the family had to make the heart-wrenching decision to rehome her beloved dog, Golden Bear.


Trust but Verify: Programmatic VLM Evaluation in the Wild

Prabhu, Viraj, Purushwalkam, Senthil, Yan, An, Xiong, Caiming, Xu, Ran

arXiv.org Artificial Intelligence

Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Vision-language models (VLMs) have emerged as an effective solution for generating responses to queries about visual content. This has led to a flurry of research on reliably benchmarking VLM performance (Liu et al., 2024a), by measuring not just the helpfulness but also the truthfulness of their responses. Existing benchmarks fall into two categories - discriminative (Hu et al., 2023; Lovenia et al., 2023; Li et al., 2023), which evaluate the model's responses to close-ended, existence-based queries ("Is there a man in this image?"), While discriminative benchmarks ease evaluation, they do not realistically simulate in-the-wild usage.


The 5 Best Prime Day Vacuum Deals We've Found (2024)

WIRED

I have a perhaps inappropriate, anthropomorphic relationship with whatever robot vacuum is running in my house. No matter how much trouble they cause me--if they get trapped in the ledge by the fireplace or lost under the couch--I never forget that it's here to help me battle the chaotic mess that my two kids and two dogs perpetrate upon me daily. Have I convinced you that you need one, too? You're in luck because the Amazon Prime Day vacuum deals lineup includes five of my top picks. Whether you need an all-in-one cleaning station, a simple picker-upper after dinner, or one with an air freshener, we have you covered.


Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation

Dahary, Omer, Patashnik, Or, Aberman, Kfir, Cohen-Or, Daniel

arXiv.org Artificial Intelligence

Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.


Luwu Dynamics XGO-Mini2 Review: Programmable Robotic Rover

WIRED

The XGO is a lap-size robot dog, marketed as "a metal pet on your desk," but it's primarily sold as a learning tool for programmers with an interest in machine vision and robotic automation. Robot pet fans should know, however, that this metallic mutt has more in common with Boston Dynamics' ominously-styled Spot than with Sony's consciously cute Aibo, with a remarkably well-made and solidly engineered metal body. Luwu Dynamics is clear that the XGO-Mini2 is more of a tool than a companion. Also, at $849, it is much more affordable and considerably more open to tinkering than Sony's $2,900 robot pet and more than $73,000 cheaper than Boston Dynamics' robotic quadruped. The XGO range, as it's sold, is fundamentally a robot body peripheral for a Raspberry Pi compute module.


Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Chen, Jun, Zhu, Deyao, Haydarov, Kilichbek, Li, Xiang, Elhoseiny, Mohamed

arXiv.org Artificial Intelligence

Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, BLIP-2 is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. Through the human evaluation experiments, we found that nearly 62.5% of participants agree that Video ChatCaptioner can cover more visual information compared to ground-truth captions.