AITopics | lateral

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Neural Information Processing SystemsMar-21-2026, 14:58:22 GMT

While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one.

artificial intelligence, large language model, natural language, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Add feedback

914b372356d58d9e9357b29332cb8fdc-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 15:53:53 GMT

large language model, machine learning, puzzle, (18 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (0.68)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

914b372356d58d9e9357b29332cb8fdc-Paper-Conference.pdf

Neural Information Processing SystemsOct-10-2025, 09:38:56 GMT

benchmark, lateral, puzzle, (15 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (1.00)

Industry:

Education (0.68)
Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.78)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.49)

Add feedback

Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates

Madahar, Abhinav

arXiv.org Artificial IntelligenceOct-3-2025

Modern deployments increasingly allocate large test-time compute (thousands of tokens or many node expansions) to boost reliability. Under such budgets, standard Tree-of-Thoughts-style search exhibits two pathologies: breadth saturation (additional samples mostly produce near-duplicates, so width stops growing) and depth myopia (noisy short-horizon utilities prune branches whose payoff appears after a few more steps). We propose Lateral Tree-of-Thoughts (LToT), a drop-in controller that separates utility from logical consistency and treats low-utility but consistent candidates as assets rather than waste. The frontier is split into mainlines (high-utility candidates used for exploitation) and laterals (consistent, initially low-utility candidates that receive short, cheap probes before judgment). LToT explores laterals via Lateral Racing with Short-Circuit (LR--SC): a capped successive-halving race that spreads tiny probes across a very wide lateral set, uses width-aware thresholds with repeat-to-confirm, and immediately promotes a branch once its envelope clears the mainline bar; mainlines are kept intentionally narrow so surplus compute is invested where width is cheap. We prove a pseudolinear lateral cost $Θ(N_0 \log_η N_0)$ with logarithmically many rungs (initial lateral width $N_0$; culling factor $η>1$), in contrast to the exponential growth of uncapped mainlines. Empirical evaluations on benchmark tasks are in preparation and will be added in a future revision. In short, LToT turns large test-time budgets into principled diversity while preserving promotion discipline, mitigating saturation and myopia without inflating compute.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.015

Genre: Research Report (1.00)

Industry: Health & Medicine (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

FlowDrive: moderated flow matching with data balancing for trajectory planning

Wang, Lingguang, Taş, Ömer Şahin, Steiner, Marlon, Stiller, Christoph

arXiv.org Artificial IntelligenceSep-29-2025

Learning-based planners are sensitive to the long-tailed distribution of driving data. Common maneuvers dominate datasets, while dangerous or rare scenarios are sparse. This imbalance can bias models toward the frequent cases and degrade performance on critical scenarios. To tackle this problem, we compare balancing strategies for sampling training data and find reweighting by trajectory pattern an effective approach. We then present FlowDrive, a flow-matching trajectory planner that learns a conditional rectified flow to map noise directly to trajectory distributions with few flow-matching steps. We further introduce moderated, in-the-loop guidance that injects small perturbation between flow steps to systematically increase trajectory diversity while remaining scene-consistent. On nuPlan and the interaction-focused interPlan benchmarks, FlowDrive achieves state-of-the-art results among learning-based planners and approaches methods with rule-based refinements. After adding moderated guidance and light post-processing (FlowDrive*), it achieves overall state-of-the-art performance across nearly all benchmark splits.

large language model, machine learning, trajectory, (18 more...)

arXiv.org Artificial Intelligence

2509.21961

Genre: Research Report (0.65)

Industry: Transportation > Ground > Road (0.30)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.35)

Add feedback

Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

Dong, Wenhan, Hu, Tianyi, Zheng, Jingyi, Sun, Zhen, Zhao, Yuemeng, Liu, Yule, He, Xinlei, Huang, Xinyi

arXiv.org Artificial IntelligenceJun-2-2025

Multi-round incomplete information tasks are crucial for evaluating the lateral thinking capabilities of large language models (LLMs). Currently, research primarily relies on multiple benchmarks and automated evaluation metrics to assess these abilities. However, our study reveals novel insights into the limitations of existing methods, as they often yield misleading results that fail to uncover key issues, such as shortcut-taking behaviors, rigid patterns, and premature task termination. These issues obscure the true reasoning capabilities of LLMs and undermine the reliability of evaluations. To address these limitations, we propose a refined set of evaluation standards, including inspection of reasoning paths, diversified assessment metrics, and comparative analyses with human performance.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.23843

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)

Add feedback

Solving Situation Puzzles with Large Language Model and External Reformulation

Li, Kun, Chen, Xinwei, Song, Tianyou, Zhou, Chengrui, Liu, Zhuoran, Zhang, Zhenyan, Guo, Jiangjian, Shan, Qing

arXiv.org Artificial IntelligenceMar-24-2025

In recent years, large language models (LLMs) have shown an impressive ability to perform arithmetic and symbolic reasoning tasks. However, we found that LLMs (e.g., ChatGPT) cannot perform well on reasoning that requires multiple rounds of dialogue, especially when solving situation puzzles. Specifically, LLMs intend to ask very detailed questions focusing on a specific aspect or same/similar questions after several rounds of Q&As. To help LLMs get out of the above dilemma, we propose a novel external reformulation methodology, where the situation puzzle will be reformulated after several rounds of Q&A or when the LLMs raise an incorrect guess. Experiments show superior performance (e.g., win rate, number of question/guess attempts) of our method than directly using LLMs for solving situation puzzles, highlighting the potential of strategic problem reformulation to enhance the reasoning capabilities of LLMs in complex interactive scenarios.

large language model, machine learning, puzzle, (17 more...)

arXiv.org Artificial Intelligence

2503.18394

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > Illinois > Champaign County > Champaign (0.04)
South America > Peru (0.04)
(8 more...)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Investigating the effect of CPT in lateral spreading prediction using Explainable AI

Hsiao, Cheng-Hsi, Rathje, Ellen, Kumar, Krishna

arXiv.org Artificial IntelligenceMar-17-2025

This study proposes an autoencoder approach to extract latent features from cone penetration test profiles to evaluate the potential of incorporating CPT data in an AI model. We employ autoencoders to compress 200 CPT profiles of soil behavior type index (Ic) and normalized cone resistance (qc1Ncs) into ten latent features while preserving critical information. We then utilize the extracted latent features with site parameters to train XGBoost models for predicting lateral spreading occurrences in the 2011 Christchurch earthquake. Models using the latent CPT features outperformed models with conventional CPT metrics or no CPT data, achieving over 83% accuracy. Explainable AI revealed the most crucial latent feature corresponding to soil behavior between 1-3 meter depths, highlighting this depth range's criticality for liquefaction evaluation. The autoencoder approach provides an automated technique for condensing CPT profiles into informative latent features for machine-learning liquefaction models.

artificial intelligence, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2503.13389

Country:

North America > United States > Texas > Travis County > Austin (0.15)
Oceania > New Zealand (0.04)

Genre: Research Report > New Finding (0.95)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.81)
Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.61)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)

Add feedback

Thinking Fast and Laterally: Multi-Agentic Approach for Reasoning about Uncertain Emerging Events

Dernbach, Stefan, Michel, Alejandro, Agarwal, Khushbu, Brissette, Christopher, Gupta, Geetika, Choudhury, Sutanay

arXiv.org Artificial IntelligenceDec-10-2024

This paper introduces lateral thinking to implement System-2 reasoning capabilities in AI systems, focusing on anticipatory and causal reasoning under uncertainty. We present a framework for systematic generation and modeling of lateral thinking queries and evaluation datasets. We introduce Streaming Agentic Lateral Thinking (SALT), a multi-agent framework designed to process complex, low-specificity queries in streaming data environments. SALT implements lateral thinking-inspired System-2 reasoning through a dynamic communication structure between specialized agents. Our key insight is that lateral information flow across long-distance agent interactions, combined with fine-grained belief management, yields richer information contexts and enhanced reasoning. Preliminary quantitative and qualitative evaluations indicate SALT's potential to outperform single-agent systems in handling complex lateral reasoning tasks in a streaming environment.

agent, query, reasoning, (16 more...)

arXiv.org Artificial Intelligence

2412.07977

Country:

North America > United States (0.28)
Asia > Taiwan (0.06)
Asia > China (0.05)
(6 more...)

Genre: Research Report (0.82)

Industry:

Energy (1.00)
Banking & Finance (1.00)
Government > Commerce (0.93)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.46)

Add feedback

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

Chen, Qi, Zhang, Bowen, Wang, Gang, Wu, Qi

arXiv.org Artificial IntelligenceOct-9-2024

While advancements in NLP have significantly improved the performance of Large Language Models (LLMs) on tasks requiring vertical thinking, their lateral thinking capabilities remain under-explored and challenging to measure due to the complexity of assessing creative thought processes and the scarcity of relevant data. To address these challenges, we introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit LAteral Thinking of LLMs. This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation, which often necessitates a stronger evaluation model. This framework simulates an interactive game where the model (player) asks the evaluation model (judge) questions about an incomplete story to infer the full scenario. The judge answers based on a detailed reference scenario or evaluates if the player's predictions align with the reference one. This approach lessens dependence on more robust evaluation models, enabling the assessment of state-of-the-art LLMs. The experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy, achieving over 80% agreement-similar to the agreement levels among humans. Furthermore, applying data and reasoning processes from our benchmark to other lateral thinking-related benchmarks, e.g., RiddleSense and BrainTeaser, leads to performance enhancements. This suggests that our benchmark effectively evaluates and elicits the lateral thinking abilities of LLMs. Code is available at: https://github.com/chenqi008/LateralThinking.

benchmark, llm, puzzle, (14 more...)

arXiv.org Artificial Intelligence

2410.06733

Genre: Research Report (1.00)

Industry: Leisure & Entertainment > Games > Computer Games (0.34)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Filters

Collaborating Authors

lateral

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles

914b372356d58d9e9357b29332cb8fdc-Paper-Conference.pdf

914b372356d58d9e9357b29332cb8fdc-Paper-Conference.pdf

Lateral Tree-of-Thoughts Surpasses ToT by Incorporating Logically-Consistent, Low-Utility Candidates

FlowDrive: moderated flow matching with data balancing for trajectory planning

Evaluation Hallucination in Multi-Round Incomplete Information Lateral-Driven Reasoning Tasks

Solving Situation Puzzles with Large Language Model and External Reformulation

Investigating the effect of CPT in lateral spreading prediction using Explainable AI

Thinking Fast and Laterally: Multi-Agentic Approach for Reasoning about Uncertain Emerging Events

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles