forall
- North America > United States > New York (0.05)
- Asia > Middle East > Jordan (0.04)
Near-OptimalGoal-Oriented Reinforcement LearninginNon-StationaryEnvironments
The different roles of c and P in this lower bound inspire us to design algorithms that estimate costs and transitions separately. Specifically, assuming the knowledge of c and P, we develop a simple but sub-optimal algorithm and another more involved minimax optimal algorithm (up to logarithmic terms). These algorithms combine the ideas of finite-horizon approximation [Chen et al., 2022a], special Bernstein-style bonuses of the MVP algorithm[Zhangetal.,2020],adaptiveconfidencewidening[WeiandLuo,2021],as well as some new techniques such as properly penalizing long-horizon policies. Finally,when c and P are unknown, we develop avariant ofthe MASTER algorithm [Weiand Luo,2021]and integrate the aforementioned ideas into itto achieve O(min{B?S
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > Barbados (0.04)
- Asia > Middle East > Jordan (0.04)
c74214a3877c4d8297ac96217d5189b7-Paper.pdf
However, the resulting methods often suffer from high computational complexity which has reduced their practical applicability. For example, in the case of multiclass logistic regression, the aggregating forecaster (Foster et al. (2018)) achievesaregret ofO(log(Bn))whereas Online Newton Step achieves O(eBlog(n))obtaining adouble exponential gaininB (aboundonthenormof comparativefunctions).
- Europe > France > Île-de-France > Paris > Paris (0.05)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
UnderstandingGlobalFeatureContributionsWith AdditiveImportanceMeasures
Most recent research hasaddressed thisby focusing onlocal interpretability, which explains a model's individual predictions (e.g., the role of each feature in a patient's diagnosis) [25, 30, 34, 38]. Twospecial cases areS = andS = D, which respectively correspond to the mean prediction f (x ) = E[f(X)] and the full model predictionfD(x) = f(x).
- North America > United States > Washington > King County > Seattle (0.05)
- North America > United States > Washington > King County > Redmond (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > District of Columbia > Washington (0.05)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > Germany (0.04)
3e6260b81898beacda3d16db379ed329-Supplemental.pdf
Moreover,we set the initial distributionξ1 tobeuniformoverS. As mentioned in the discussion following Theorem 4.1, it holds thatDVA DFQI. These findings also shed light on the minimax optimality of the OPE problem. PH h=1kvhkΛ 1h, is tighter. Here taking maximum with1 is to deal with the situation wherebVhbVπh+1(,) is close to zero or negative, and the second1 is to account for the variance of the rewards.
TheLoCARegret: AConsistentMetrictoEvaluate Model-BasedBehaviorinReinforcementLearning--SupplementaryMaterial -- ATabularExperiments
For all tabular experiments, we used -greedy exploration with = 0.1. Furthermore, during pretraining and training, we used a maximum episode-length of 100. For evaluation, we set = 0, and ran 10 evaluation episodes. We used a fixed step-sizeα for all tabular experiments. Therefore, there is stochasticity in the update target even in deterministic environments due to exploration of the behavior policy.
VeriCoT: Neuro-symbolic Chain-of-Thought Validation via Logical Consistency Checks
Feng, Yu, Weir, Nathaniel, Bostrom, Kaj, Bayless, Sam, Cassel, Darion, Chaudhary, Sapana, Kiesl-Reiter, Benjamin, Rangwala, Huzefa
LLMs can perform multi-step reasoning through Chain-of-Thought (CoT), but they cannot reliably verify their own logic. Even when they reach correct answers, the underlying reasoning may be flawed, undermining trust in high-stakes scenarios. To mitigate this issue, we introduce VeriCoT, a neuro-symbolic method that extracts and verifies formal logical arguments from CoT reasoning. VeriCoT formalizes each CoT reasoning step into first-order logic and identifies premises that ground the argument in source context, commonsense knowledge, or prior reasoning steps. The symbolic representation enables automated solvers to verify logical validity while the NL premises allow humans and systems to identify ungrounded or fallacious reasoning steps. Experiments on the ProofWriter, LegalBench, and BioASQ datasets show VeriCoT effectively identifies flawed reasoning, and serves as a strong predictor of final answer correctness. We also leverage VeriCoT's verification signal for (1) inference-time self-reflection, (2) supervised fine-tuning (SFT) on VeriCoT-distilled datasets and (3) preference fine-tuning (PFT) with direct preference optimization (DPO) using verification-based pairwise rewards, further improving reasoning validity and accuracy.
- Europe > Austria > Vienna (0.14)
- Europe > Switzerland (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (7 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)
- Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (0.75)
- Information Technology > Artificial Intelligence > Natural Language > Generation (0.67)