loaf
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Zhao, Siyan, Gupta, Devaansh, Zheng, Qinqing, Grover, Aditya
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO, the first integration of policy gradient methods to masked dLLMs. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM. Our code is released at https://dllm-reasoning.github.io/.
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Payoungkhamdee, Patomporn, Tuchinda, Pume, Baek, Jinheon, Cahyawijaya, Samuel, Udomcharoenchaikit, Can, Manakul, Potsawee, Limkonchotiwat, Peerat, Chuangsuwanich, Ekapol, Nutanong, Sarana
Multi-step reasoning is essential for large language models (LLMs), yet multilingual performance remains challenging. While Chain-of-Thought (CoT) prompting improves reasoning, it struggles with non-English languages due to the entanglement of reasoning and execution. Program-of-Thought (PoT) prompting separates reasoning from execution, offering a promising alternative but shifting the challenge to generating programs from non-English questions. We propose a framework to evaluate PoT by separating multilingual reasoning from code execution to examine (i) the impact of fine-tuning on question-reasoning alignment and (ii) how reasoning quality affects answer correctness. Our findings demonstrate that PoT fine-tuning substantially enhances multilingual reasoning, outperforming CoT fine-tuned models. We further demonstrate a strong correlation between reasoning quality (measured through code quality) and answer accuracy, highlighting its potential as a test-time performance improvement heuristic.
- Asia > Thailand > Bangkok > Bangkok (0.05)
- Asia > Singapore (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- (3 more...)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement
Xie, Yuxi, Goyal, Anirudh, Wu, Xiaobao, Yin, Xunjian, Xu, Xiao, Kan, Min-Yen, Pan, Liangming, Wang, William Yang
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. However, existing approaches typically implement iterative refinement at the application or prompting level, relying on autoregressive (AR) modeling. The sequential token generation in AR models can lead to high inference latency. To overcome these challenges, we propose Context-Wise Order-Agnostic Language Modeling (COrAL), which incorporates iterative refinement directly into the LLM architecture while maintaining computational efficiency. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally during the generation process. Leveraging the order-agnostic nature of COrAL, we introduce sliding blockwise order-agnostic decoding, which performs multi-token forward prediction and backward reconstruction within context windows. This allows the model to iteratively refine its outputs in parallel in the sliding block, effectively capturing diverse dependencies without the high inference cost of sequential generation. Empirical evaluations on reasoning tasks demonstrate that COrAL improves performance and inference speed, respectively, achieving absolute accuracy gains of $4.6\%$ on GSM8K and $4.0\%$ on LogiQA, along with inference speedups of up to $3.9\times$ over next-token baselines. Preliminary results on code generation indicate a drop in pass rates due to inconsistencies in order-agnostic outputs, highlighting the inherent quality--speed trade-off. Our code is publicly available at https://github.com/YuxiXie/COrAL.
- Europe > Austria > Vienna (0.15)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
- (20 more...)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning
Xie, Yuxi, Goyal, Anirudh, Zheng, Wenyue, Kan, Min-Yen, Lillicrap, Timothy P., Kawaguchi, Kenji, Shieh, Michael
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process inspired by the successful strategy employed by AlphaZero. Our work leverages Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. To enhance consistency in intermediate steps, we combine outcome validation and stepwise self-evaluation, continually updating the quality assessment of newly generated data. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data. Theoretical analysis reveals the importance of using on-policy sampled data for successful self-improving. Extensive evaluations on various arithmetic and commonsense reasoning tasks demonstrate remarkable performance improvements over existing models. For instance, our approach outperforms the Mistral-7B Supervised Fine-Tuning (SFT) baseline on GSM8K, MATH, and ARC-C, with substantial increases in accuracy to $81.8\%$ (+$5.9\%$), $34.7\%$ (+$5.8\%$), and $76.4\%$ (+$15.8\%$), respectively. Additionally, our research delves into the training and inference compute tradeoff, providing insights into how our method effectively maximizes performance gains. Our code is publicly available at https://github.com/YuxiXie/MCTS-DPO.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Singapore (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (10 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
Tab-CoT: Zero-shot Tabular Chain of Thought
The chain-of-though (CoT) prompting methods were successful in various natural language processing (NLP) tasks thanks to their ability to unveil the underlying complex reasoning processes. Such reasoning processes typically exhibit implicitly structured steps. Recent efforts also started investigating methods to encourage more explicitly structured reasoning procedures to be captured. In this work, we propose Tab-CoT, a novel tabular-format CoT prompting method, which allows the complex reasoning process to be explicitly modelled in a highly structured manner. Despite its simplicity, we show that our approach is capable of performing reasoning across multiple dimensions (i.e., both rows and columns). We demonstrate our approach's strong zero-shot and few-shot capabilities through extensive experiments on a range of reasoning tasks.
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania (0.04)
- North America > United States > California > Los Angeles County > Beverly Hills (0.04)
- (2 more...)
- Research Report (1.00)
- Workflow (0.68)
30 Great Deals at Best Buy, Target, and Other Amazon Prime Day Rivals (Updated)
Prime Day is nearly over, and while Amazon still has plenty of discounts, so do its competitors. We've gathered up corresponding deals from Walmart, Target, Best Buy, and other stores. You won't need a membership to shop these sales, but you should keep in mind that this is just the start of the holiday shopping season. Black Friday and Cyber Monday are just over a month away, and we'll be covering those sales, too. Note: We strike through items that sell out or rise in price as we update this guide.
- Information Technology > Artificial Intelligence (0.54)
- Information Technology > Communications > Mobile (0.31)
A new fleet of autonomous robots is now making one of the world's oldest foods
In the beginning, archaeologists believe, the first breads were created using some of the most rudimentary technologies in human history: fire and stone. In the region that now encompasses Jordan, one of the world's most ancient examples -- a flatbread vaguely resembling pita and made from wild cereal grains and water -- was cooked in large fireplaces using flat basalt stones, according to Reuters. The taste is "gritty and salty," Amaia Arranz-Otaegui, a University of Copenhagen postdoctoral researcher in archaeobotany, told the news service. "But it is a bit sweet, as well." More than 10,000 years later, bread has clearly evolved but, perhaps, not as dramatically as the technology being used to bake it.
- Europe > Denmark > Capital Region > Copenhagen (0.25)
- Asia > Middle East > Jordan (0.25)
- North America > United States > Nevada > Clark County > Las Vegas (0.05)
- North America > United States > Colorado (0.05)
Meet the BreadBot: Autonomous bread-making robot bakes 10 loaves every hour
It could be the best thing since sliced bread. A family-owned baking business is attempting to disrupt how you get your next loaf at the grocery store with the first, fully automated bread-making machine. Called the BreadBot, Washington-based Wilkinson Baking Company took the wraps off the machine at the Consumer Electronics Show in Las Vegas, which mixes, kneads and bakes bread in just 90 minutes. Up to 10 loaves are ready to pick up from a vending machine every hour and its creators say the BreadBot does as good of a job as a human baker - so good that the machines are expected to land in major grocery stores soon. BreadBot can make just about any kind of loaf you want, including whole wheat, nine grain, honey oat and rye.
- Retail (0.62)
- Consumer Products & Services > Food, Beverage, Tobacco & Cannabis (0.62)
- Semiconductors & Electronics (0.59)
How To Score A Summer Internship At SpaceX
But this year's going to be a little different. The aerospace engineering major is headed to SpaceX headquarters outside Los Angeles for her first summer internship. Andrews is SpaceX's 2017 'Brooke Owens Fellow' – one of 36 undergraduate women working in paid positions at aviation and space companies across the country. Other young ladies in the inaugural class of fellows will join up with brands including Orbital ATK, Blue Origin, and Virgin Orbit. Here are a few of the Georgia Tech junior's hottest tips on how to get in with the Mars-bound crew at SpaceX – the company promising to start up its space tourism program in 2018: Andrews started her engineering career early.
- Aerospace & Defense (1.00)
- Transportation > Air (0.59)