Technology
Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLMReasoning
Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.
BayeSQP: Bayesian Optimization through Sequential Quadratic Programming
We introduce BayeSQP, a novel algorithm for general black-box optimization that merges the structure of sequential quadratic programming with concepts from Bayesian optimization. BayeSQP employs second-order Gaussian process surrogates for both the objective and constraints to jointly model the function values, gradients, and Hessian from only zero-order information. At each iteration, a local subproblem is constructed using the GP posterior estimates and solved to obtain a search direction. Crucially, the formulation of the subproblem explicitly incorporates uncertainty in both the function and derivative estimates, resulting in a tractable second-order cone program for high probability improvements under model uncertainty. A subsequent one-dimensional line search via constrained Thompson sampling selects the next evaluation point. Empirical results show that BayeSQPoutperforms state-of-the-art methods in specific high-dimensional settings. Our algorithm offers a principled and flexible framework that bridges classical optimization techniques with modern approaches to black-box optimization.
Prior-Guided Diffusion Planning for Offline Reinforcement Learning
Diffusion models have recently gained prominence in offline reinforcement learning due to their ability to effectively learn high-performing, generalizable policies from static datasets. Diffusion-based planners facilitate long-horizon decisionmaking by generating high-quality trajectories through iterative denoising, guided by return-maximizing objectives. However, existing guided sampling strategies such as Classifier Guidance, Classifier-Free Guidance, and Monte Carlo Sample Selection either produce suboptimal multi-modal actions, struggle with distributional drift, or incur prohibitive inference-time costs. To address these challenges, we propose Prior Guidance (PG), a novel guided sampling framework that replaces the standard Gaussian prior of a behavior-cloned diffusion model with a learnable distribution, optimized via a behavior-regularized objective. PG directly generates high-value trajectories without costly reward optimization of the diffusion model itself, and eliminates the need to sample multiple candidates at inference for sample selection. We present an efficient training strategy that applies behavior regularization in latent space, and empirically demonstrate that PG outperforms state-of-the-art diffusion policies and planners across diverse long-horizon offline RL benchmarks. Our code is available at https://github.com/ku-dmlab/PG.
FlowerTune: ACross-Domain Benchmark for Federated Fine-Tuning of Large Language Models
Large Language Models (LLMs) have achieved state-of-the-art results across diverse domains, yet their development remains reliant on vast amounts of publicly available data, raising concerns about data scarcity and the lack of access to domainspecific, sensitive information. Federated Learning (FL) presents a compelling framework to address these challenges by enabling decentralized fine-tuning on pre-trained LLMs without sharing raw data. However, the compatibility and performance of pre-trained LLMs in FL settings remain largely under explored. We introduce the FlowerTune LLMLeaderboard, a first-of-its-kind benchmarking suite designed to evaluate federated fine-tuning of LLMs across four diverse domains: general NLP, finance, medical, and coding. Each domain includes federated instruction-tuning datasets and domain-specific evaluation metrics. Our results, obtained through a collaborative, open-source and community-driven approach, provide the first comprehensive comparison across 26 pre-trained LLMs with different aggregation and fine-tuning strategies under federated settings, offering actionable insights into model performance, resource constraints, and domain adaptation. This work lays the foundation for developing privacy-preserving, domain-specialized LLMs for real-world applications.
Object State Recognition Initial StatearT nsitioning State End State LLMPlease provide the initial, transitioning, and end states for slicing a lemon
Recognizing the physical states of objects and their transformations within videos is crucial for structured video understanding and enabling robust real-world applications, such as robotic manipulation. However, pretrained vision-language models often struggle to capture these nuanced dynamics and their temporal context, and specialized object state recognition frameworks may not generalize to unseen actions or objects. We introduce SAGE (State-Action Graph Embeddings), a novel framework that offers a unified model of physical state transitions by decomposing states into fine-grained, language-described visual concepts that are sharable across different objects and actions. SAGE initially leverages Large Language Models to construct a State-Action Graph, which is then multimodally refined using Vision-Language Models. Extensive experiments show that our method significantly outperforms baselines, generalizes effectively to unseen objects and actions in open-world settings. SAGE improves the prior state-of-the-art by as much as 14.6% on novel state recognition with less than 5% of its inference time.
Ranking-based Preference Optimization for Diffusion Models from Implicit User Feedback
Direct preference optimization (DPO) methods have shown strong potential in aligning text-to-image diffusion models with human preferences by training on paired comparisons. These methods improve training stability by avoiding the REINFORCE algorithm but still struggle with challenges such as accurately estimating image probabilities due to the non-linear nature of the sigmoid function and the limited diversity of offline datasets. In this paper, we introduce Diffusion Denoising Ranking Optimization (Diffusion-DRO), a new preference learning framework grounded in inverse reinforcement learning. Diffusion-DRO removes the dependency on a reward model by casting preference learning as a ranking problem, thereby simplifying the training objective into a denoising formulation and overcoming the non-linear estimation issues found in prior methods. Moreover, Diffusion-DRO uniquely integrates offline expert demonstrations with online policy-generated negative samples, enabling it to effectively capture human preferences while addressing the limitations of offline data. Comprehensive experiments show that Diffusion-DRO delivers improved generation quality across a range of challenging and unseen prompts, outperforming state-of-the-art baselines in both both quantitative metrics and user studies.
Israel launches fresh strikes on Lebanon despite Trump criticism
Israeli forces have carried out new strikes in southern Lebanon, state media say, despite renewed criticism from US President Donald Trump of Israel's actions in the country. Israeli drone strikes injured several people in Mansouri and Aaziyyeh on Wednesday, while jets attacked Nabatieh al-Fawqa and Kfar Tebnit, Lebanon's National News Agency reported. Israel's military has not commented, but it did say five soldiers were injured in a drone attack in Lebanon by the Iran-backed armed group Hezbollah. Mediator Pakistan has said the deal between the US and Iran to end the war includes Lebanon. On Tuesday, Trump said Israel's prime minister needed to be more responsible with respect to Lebanon.
Will it take a 'Chernobyl-scale disaster' for us to regulate cyber weapons of mass destruction? Stuart Russell
'The CEOs are telling us, "We're on track to create superhuman intelligence, which has a good chance of causing human extinction."' 'The CEOs are telling us, "We're on track to create superhuman intelligence, which has a good chance of causing human extinction."' Will it take a'Chernobyl-scale disaster' for us to regulate cyber weapons of mass destruction? T he AI company Anthropic has been making major headlines recently. Its trillion-dollar IPO plan and its blood feud with secretary of defense Pete Hegseth have attracted much attention, but two other events may be even more consequential.
Interactive. Violent. Gross. Inside Fishtank, the Unhinged Future of Reality TV
WIRED goes on location--and on camera--with the cult hit. On March 16, 2026, at 5:45 pm in a leafy suburb of Atlanta called Sandy Springs, police pound on the door of a neglected French Country-style mansion, rifles at the ready, bodycams rolling. Minutes earlier, a distress call came from someone claiming to be hiding from a gunman in the mansion's downstairs bathroom. The dispatcher heard a gunshot ring out in the distance, then the line disconnected. "Open the door!" an officer yells. A calm young man with a mullet and woolly eyebrows steps out, hands raised. The police ask him who else is in the house. "Just my friends," he replies, as seven other young people, men and women, silently file out behind him, less evidently relaxed. They remain outside while two officers search the house. Inside the mansion there are no immediate signs of a massacre, but the decor alone arouses suspicion. All of the windows are frosted over, so only a chilly light leaks in. The place is a mess, and the walls are adorned with lurid, seemingly AI-generated art: a frowning baby holding an assault rifle, a rubber ducky bobbing in a mug of what looks like black coffee, a lidless and levitating eyeball crying into a martini glass. The rooms are painted primary colors, grass green and cherry red, like a kindergarten class. A vape dangles from a doorframe by a chain, suspended at mouth level. The pantry is practically empty. The bedroom is a dormitory featuring seven identical twin beds. No one is hiding in the bathroom. The call, it seems, was a prank. The police return to the driveway and ask, "What is it that you guys are doing here?" "We're just livestreaming," says a man in a camo hat named Matt. "You guys don't have any firearms or anything inside the house?" There are guns in the house, Matt says, for self-defense. Fans of their livestream can be obsessive, he explains, and tend to have perverse ideas about jokes. The officer asks to see their weapons, and they go downstairs. The room is cluttered with ergonomic swivel chairs, desks strewn with takeout containers and energy drinks, two flatscreen TVs, and a dozen computer monitors.