lateral
- Education (0.68)
- Health & Medicine (0.46)
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one.
- Education (0.68)
- Health & Medicine (0.46)
Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates
Modern deployments increasingly allocate large test-time compute (thousands of tokens or many node expansions) to boost reliability. Under such budgets, standard Tree-of-Thoughts-style search exhibits two pathologies: breadth saturation (additional samples mostly produce near-duplicates, so width stops growing) and depth myopia (noisy short-horizon utilities prune branches whose payoff appears after a few more steps). We propose Lateral Tree-of-Thoughts (LToT), a drop-in controller that separates utility from logical consistency and treats low-utility but consistent candidates as assets rather than waste. The frontier is split into mainlines (high-utility candidates used for exploitation) and laterals (consistent, initially low-utility candidates that receive short, cheap probes before judgment). LToT explores laterals via Lateral Racing with Short-Circuit (LR--SC): a capped successive-halving race that spreads tiny probes across a very wide lateral set, uses width-aware thresholds with repeat-to-confirm, and immediately promotes a branch once its envelope clears the mainline bar; mainlines are kept intentionally narrow so surplus compute is invested where width is cheap. We prove a pseudolinear lateral cost $Θ(N_0 \log_η N_0)$ with logarithmically many rungs (initial lateral width $N_0$; culling factor $η>1$), in contrast to the exponential growth of uncapped mainlines. Empirical evaluations on benchmark tasks are in preparation and will be added in a future revision. In short, LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
FlowDrive: moderated flow matching with data balancing for trajectory planning
Wang, Lingguang, Taş, Ömer Şahin, Steiner, Marlon, Stiller, Christoph
Learning-based planners are sensitive to the long-tailed distribution of driving data. Common maneuvers dominate datasets, while dangerous or rare scenarios are sparse. This imbalance can bias models toward the frequent cases and degrade performance on critical scenarios. To tackle this problem, we compare balancing strategies for sampling training data and find reweighting by trajectory pattern an effective approach. We then present FlowDrive, a flow-matching trajectory planner that learns a conditional rectified flow to map noise directly to trajectory distributions with few flow-matching steps. We further introduce moderated, in-the-loop guidance that injects small perturbation between flow steps to systematically increase trajectory diversity while remaining scene-consistent. On nuPlan and the interaction-focused interPlan benchmarks, FlowDrive achieves state-of-the-art results among learning-based planners and approaches methods with rule-based refinements. After adding moderated guidance and light post-processing (FlowDrive*), it achieves overall state-of-the-art performance across nearly all benchmark splits.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
- Asia > Japan > Honshū > Chūbu > Ishikawa Prefecture > Kanazawa (0.04)
- Information Technology > Artificial Intelligence > Robots (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.35)
Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks
Dong, Wenhan, Hu, Tianyi, Zheng, Jingyi, Sun, Zhen, Zhao, Yuemeng, Liu, Yule, He, Xinlei, Huang, Xinyi
Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Asia > China > Hong Kong (0.04)
- Asia > China > Guangdong Province > Guangzhou (0.04)
Solving Situation Puzzles with Large Language Model and External Reformulation
Li, Kun, Chen, Xinwei, Song, Tianyou, Zhou, Chengrui, Liu, Zhuoran, Zhang, Zhenyan, Guo, Jiangjian, Shan, Qing
In recent years, large language models (LLMs) have shown an impressive ability to perform arithmetic and symbolic reasoning tasks. However, we found that LLMs (e.g., ChatGPT) cannot perform well on reasoning that requires multiple rounds of dialogue, especially when solving situation puzzles. Specifically, LLMs intend to ask very detailed questions focusing on a specific aspect or same/similar questions after several rounds of Q&As. To help LLMs get out of the above dilemma, we propose a novel external reformulation methodology, where the situation puzzle will be reformulated after several rounds of Q&A or when the LLMs raise an incorrect guess. Experiments show superior performance (e.g., win rate, number of question/guess attempts) of our method than directly using LLMs for solving situation puzzles, highlighting the potential of strategic problem reformulation to enhance the reasoning capabilities of LLMs in complex interactive scenarios.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Illinois > Champaign County > Champaign (0.04)
- South America > Peru (0.04)
- (8 more...)
Investigating the effect of CPT in lateral spreading prediction using Explainable AI
Hsiao, Cheng-Hsi, Rathje, Ellen, Kumar, Krishna
This study proposes an autoencoder approach to extract latent features from cone penetration test profiles to evaluate the potential of incorporating CPT data in an AI model. We employ autoencoders to compress 200 CPT profiles of soil behavior type index (Ic) and normalized cone resistance (qc1Ncs) into ten latent features while preserving critical information. We then utilize the extracted latent features with site parameters to train XGBoost models for predicting lateral spreading occurrences in the 2011 Christchurch earthquake. Models using the latent CPT features outperformed models with conventional CPT metrics or no CPT data, achieving over 83% accuracy. Explainable AI revealed the most crucial latent feature corresponding to soil behavior between 1-3 meter depths, highlighting this depth range's criticality for liquefaction evaluation. The autoencoder approach provides an automated technique for condensing CPT profiles into informative latent features for machine-learning liquefaction models.
- North America > United States > Texas > Travis County > Austin (0.15)
- Oceania > New Zealand (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.81)
- Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.61)
- Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.61)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)
Thinking Fast and Laterally: Multi-Agentic Approach for Reasoning about Uncertain Emerging Events
Dernbach, Stefan, Michel, Alejandro, Agarwal, Khushbu, Brissette, Christopher, Gupta, Geetika, Choudhury, Sutanay
This paper introduces lateral thinking to implement System-2 reasoning capabilities in AI systems, focusing on anticipatory and causal reasoning under uncertainty. We present a framework for systematic generation and modeling of lateral thinking queries and evaluation datasets. We introduce Streaming Agentic Lateral Thinking (SALT), a multi-agent framework designed to process complex, low-specificity queries in streaming data environments. SALT implements lateral thinking-inspired System-2 reasoning through a dynamic communication structure between specialized agents. Our key insight is that lateral information flow across long-distance agent interactions, combined with fine-grained belief management, yields richer information contexts and enhanced reasoning. Preliminary quantitative and qualitative evaluations indicate SALT's potential to outperform single-agent systems in handling complex lateral reasoning tasks in a streaming environment.
- North America > United States (0.28)
- Asia > Taiwan (0.06)
- Asia > China (0.05)
- (6 more...)
- Energy (1.00)
- Banking & Finance (1.00)
- Government > Commerce (0.93)
- (2 more...)
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
Chen, Qi, Zhang, Bowen, Wang, Gang, Wu, Qi
While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: https://github.com/chenqi008/LateralThinking.