puppy
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Zhao, Siyan, Gupta, Devaansh, Zheng, Qinqing, Grover, Aditya
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO, the first integration of policy gradient methods to masked dLLMs. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM. Our code is released at https://dllm-reasoning.github.io/.
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Jiang, Dongzhi, Zhang, Renrui, Guo, Ziyu, Li, Yanwei, Qi, Yu, Chen, Xinyan, Wang, Liuhui, Jin, Jianhan, Guo, Claire, Yan, Shen, Zhang, Bo, Fu, Chaoyou, Gao, Peng, Li, Hongsheng
Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/
- Workflow (0.68)
- Research Report (0.50)
- Health & Medicine > Consumer Health (1.00)
- Education > Health & Safety > School Nutrition (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.50)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
Robotic dog helps those facing mental health and cognitive challenges
Jennie the artificial intelligence-powered robotic dog is designed to provide comfort and companionship to those with mental health challenges. U.S. robotics company Tombot has introduced Jennie, an innovative AI-powered robotic pet designed to provide comfort and companionship to those facing cognitive health challenges. This groundbreaking creation is set to transform the lives of millions struggling with dementia, mild cognitive impairment and various mental health issues. Jennie's inception stems from a personal tragedy experienced by Tombot CEO Tom Stevens. When his mother, Nancy, was diagnosed with Alzheimer's, the family had to make the heart-wrenching decision to rehome her beloved dog, Golden Bear.
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Health & Medicine > Therapeutic Area > Neurology (0.93)
Trust but Verify: Programmatic VLM Evaluation in the Wild
Prabhu, Viraj, Purushwalkam, Senthil, Yan, An, Xiong, Caiming, Xu, Ran
Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. To construct PROVE, we provide a large language model (LLM) with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object to verify each QA pair. We thus construct a benchmark of 10.5k challenging but visually grounded QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose a programmatic evaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two. Vision-language models (VLMs) have emerged as an effective solution for generating responses to queries about visual content. This has led to a flurry of research on reliably benchmarking VLM performance (Liu et al., 2024a), by measuring not just the helpfulness but also the truthfulness of their responses. Existing benchmarks fall into two categories - discriminative (Hu et al., 2023; Lovenia et al., 2023; Li et al., 2023), which evaluate the model's responses to close-ended, existence-based queries ("Is there a man in this image?"), While discriminative benchmarks ease evaluation, they do not realistically simulate in-the-wild usage.
The 5 Best Prime Day Vacuum Deals We've Found (2024)
I have a perhaps inappropriate, anthropomorphic relationship with whatever robot vacuum is running in my house. No matter how much trouble they cause me--if they get trapped in the ledge by the fireplace or lost under the couch--I never forget that it's here to help me battle the chaotic mess that my two kids and two dogs perpetrate upon me daily. Have I convinced you that you need one, too? You're in luck because the Amazon Prime Day vacuum deals lineup includes five of my top picks. Whether you need an all-in-one cleaning station, a simple picker-upper after dinner, or one with an air freshener, we have you covered.
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
Dahary, Omer, Patashnik, Or, Aberman, Kfir, Cohen-Or, Daniel
Text-to-image diffusion models have an unprecedented ability to generate diverse and high-quality images. However, they often struggle to faithfully capture the intended semantics of complex input prompts that include multiple subjects. Recently, numerous layout-to-image extensions have been introduced to improve user control, aiming to localize subjects represented by specific tokens. Yet, these methods often produce semantically inaccurate images, especially when dealing with multiple semantically or visually similar subjects. In this work, we study and analyze the causes of these limitations. Our exploration reveals that the primary issue stems from inadvertent semantic leakage between subjects in the denoising process. This leakage is attributed to the diffusion model's attention layers, which tend to blend the visual features of different subjects. To address these issues, we introduce Bounded Attention, a training-free method for bounding the information flow in the sampling process. Bounded Attention prevents detrimental leakage among subjects and enables guiding the generation to promote each subject's individuality, even with complex multi-subject conditioning. Through extensive experimentation, we demonstrate that our method empowers the generation of multiple subjects that better align with given prompts and layouts.
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Luwu Dynamics XGO-Mini2 Review: Programmable Robotic Rover
The XGO is a lap-size robot dog, marketed as "a metal pet on your desk," but it's primarily sold as a learning tool for programmers with an interest in machine vision and robotic automation. Robot pet fans should know, however, that this metallic mutt has more in common with Boston Dynamics' ominously-styled Spot than with Sony's consciously cute Aibo, with a remarkably well-made and solidly engineered metal body. Luwu Dynamics is clear that the XGO-Mini2 is more of a tool than a companion. Also, at $849, it is much more affordable and considerably more open to tinkering than Sony's $2,900 robot pet and more than $73,000 cheaper than Boston Dynamics' robotic quadruped. The XGO range, as it's sold, is fundamentally a robot body peripheral for a Raspberry Pi compute module.
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Chen, Jun, Zhu, Deyao, Haydarov, Kilichbek, Li, Xiang, Elhoseiny, Mohamed
Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and enriched video descriptions continues to be a substantial challenge. In this work, we introduce Video ChatCaptioner, an approach for creating more comprehensive spatiotemporal video descriptions. Our method employs a ChatGPT model as a controller, specifically designed to select frames for posing video content-driven questions. Subsequently, BLIP-2 is utilized to answer these visual queries. This question-answer framework effectively uncovers intricate video details and shows promise as a method for enhancing video content. Following multiple conversational rounds, ChatGPT can summarize enriched video content based on previous conversations. Through the human evaluation experiments, we found that nearly 62.5% of participants agree that Video ChatCaptioner can cover more visual information compared to ground-truth captions.
Explainable Verbal Reasoner Plus (EVR+): A Natural Language Reasoning Framework that Supports Diverse Compositional Reasoning
Liang, Zhengzhong, Zhang, Zeyu, Bethard, Steven, Surdeanu, Mihai
Languages models have been successfully applied to a variety of reasoning tasks in NLP, yet the language models still suffer from compositional generalization. In this paper we present Explainable Verbal Reasoner Plus (EVR+), a reasoning framework that enhances language models' compositional reasoning ability by (1) allowing the model to explicitly generate and execute symbolic operators, and (2) allowing the model to decompose a complex task into several simpler ones in a flexible manner. Compared with its predecessor Explainable Verbal Reasoner (EVR) and other previous approaches adopting similar ideas, our framework supports more diverse types of reasoning such as nested loops and different types of recursion. To evaluate our reasoning framework, we build a synthetic dataset with five tasks that require compositional reasoning. Results show that our reasoning framework can enhance the language model's compositional generalization performance on the five tasks, using a fine-tuned language model. We also discussed the possibility and the challenges to combine our reasoning framework with a few-shot prompted language model.
- North America > United States > Arizona > Pima County > Tucson (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Asia > China > Hong Kong (0.04)
What Are Word and Sentence Embeddings?
They are the basic building block of most language models. This article's title and TL;DR have been generated with Cohere. Get started with text generation. In old futuristic movies, such as the 2001 Space Odyssey, the main computer (HAL) was able to talk to humans and understand what they would say with great ease. At the time, getting computers to understand and produce language seemed like an impossible task, but the latest large language models (LLM) are able to do this in a way that makes it almost impossible for a human to tell if they are talking to another human, or to a computer.