AITopics | riddle

Collaborating Authors

riddle

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Retaining Knowledge for Learning with Dynamic Definition

Neural Information Processing SystemsMar-19-2026, 05:42:52 GMT

Machine learning models are often deployed in settings where they must be constantly updated in response to the changes in class definitions while retaining high accuracy on previously learned definitions. A classical use case is fraud detection, where new fraud schemes come one after another. While such an update can be accomplished by re-training on the complete data, the process is inefficient and prevents real-time and on-device learning. On the other hand, efficient methods that incrementally learn from new data often result in the forgetting of previously-learned knowledge. We define this problem as Learning with Dynamic Definition (LDD) and demonstrate that popular models, such as the Vision Transformer and Roberta, exhibit substantial forgetting of past definitions. We present the first practical and provable solution to LDD. Our proposal is a hash-based sparsity model \textit{RIDDLE} that solves evolving definitions by associating samples only to relevant parameters. We prove that our model is a universal function approximator and theoretically bounds the knowledge lost during the update process. On practical tasks with evolving class definition in vision and natural language processing, \textit{RIDDLE} outperforms baselines by up to 30\% on the original dataset while providing competitive accuracy on the update dataset.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

Neural Information Processing SystemsFeb-18-2026, 19:28:39 GMT

Imagine observing someone scratching their arm; to understand why, additional context would be necessary. However, spotting a mosquito nearby would immediately offer a likely explanation for the person's discomfort, thereby alleviating

large language model, machine learning, natural language, (17 more...)

Neural Information Processing Systems

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
North America > United States > New York (0.04)
North America > United States > California > Los Angeles County > Long Beach (0.04)
(4 more...)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Sefat, Abdullah Al

arXiv.org Artificial IntelligenceNov-27-2025

Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.20399

Country:

Europe (1.00)
Asia (0.93)
North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Industry: Education (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.72)

Add feedback

The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

M, Abhinav P, Saxena, Ojasva, C, Oswald, Krishnamurthy, Parameswari

arXiv.org Artificial IntelligenceNov-5-2025

The extent to which large language models (LLMs) can perform culturally grounded reasoning across non-English languages remains underexplored. This paper examines the reasoning and self-assessment abilities of LLMs across seven major Indian languages-Bengali, Gujarati, Hindi, Kannada, Malayalam, Tamil, and Telugu. We introduce a multilingual riddle dataset combining traditional riddles with context-reconstructed variants and evaluate five LLMs-Gemini 2.5 Pro, Gemini 2.5 Flash, Mistral-Saba, LLaMA 4 Scout, and LLaMA 4 Maverick-under seven prompting strategies. In the first stage, we assess riddle-solving performance and find that while Gemini 2.5 Pro performs best overall, few-shot methods yield only marginal gains, and accuracy varies notably across languages. In the second stage, we conduct a self-evaluation experiment to measure reasoning consistency. The results reveal a key finding: a model's initial accuracy is inversely correlated with its ability to identify its own mistakes. Top-performing models such as Gemini 2.5 Pro are overconfident (4.34% True Negative Rate), whereas lower-performing models like LLaMA 4 Scout are substantially more self-aware (42.09% True Negative Rate). These results point to clear gaps in multilingual reasoning and highlight the need for models that not only reason effectively but also recognize their own limitations.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2511.0096

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Don't throw the baby out with the bathwater: How and why deep learning for ARC

Cole, Jack, Osman, Mohamed

arXiv.org Artificial IntelligenceNov-3-2025

The Abstraction and Reasoning Corpus (ARC-AGI) presents a formidable challenge for AI systems. Despite the typically low performance on ARC, the deep learning paradigm remains the most effective known strategy for generating skillful (state-of-the-art) neural networks (NN) across varied modalities and tasks in vision, language etc. The deep learning paradigm has proven to be able to train these skillful neural networks and learn the abstractions needed in these diverse domains. Our work doubles down on that and continues to leverage this paradigm by incorporating on-the-fly NN training at test time. We demonstrate that fully committing to deep learning's capacity to acquire novel abstractions yields state-of-the-art performance on ARC. Specifically, we treat both the neural network and the optimizer (rather than just a pre-trained network) as integral components of the inference process, fostering generalization to unseen tasks. Concretely, we propose a methodology for training on ARC, starting from pretrained LLMs, and enhancing their ARC reasoning. We also propose Test-Time Fine-Tuning (TTFT) and the Augment Inference Reverse-Augmentation and Vote (AIRV) as effective test-time techniques. We are the first to propose and show deep learning can be used effectively for ARC, showing boosts of up to 260% in accuracy with AIRV and a further 300% boost with TTFT. An early version of this approach secured first place in the 2023 ARCathon competition, while the final version achieved the current best score on the ARC private test-set (58%). Our findings highlight the key ingredients of a robust reasoning system in unfamiliar domains, underscoring the central mechanisms that improve broad perceptual reasoning.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2506.14276

Country: North America (0.28)

Genre: Research Report > New Finding (0.48)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

Mukhopadhyay, Souradeep, Baral, Rishabh, Mahajan, Nimeesh, Harish, Samhitha, RRV, Aswin, Parmar, Mihir, Nakamura, Mutsumi, Baral, Chitta

arXiv.org Artificial IntelligenceOct-15-2025

Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles--but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode--phantom recall--where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift--highlighting the gap between linguistic fluency and logical understanding.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.11812

Genre: Research Report > New Finding (0.88)

Industry: Leisure & Entertainment > Games (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

fbf5efe979e6754dc06a0869233f2510-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-10-2025, 22:20:04 GMT

benchmark, riddle, visual riddle, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > New York (0.14)
Asia > Thailand > Krabi > Krabi (0.04)
Oceania > New Zealand (0.04)
(10 more...)

Genre: Research Report > New Finding (0.93)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

Adaptive Originality Filtering: Rejection Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

Le, Duy, Ziti, Kent, Girard-Sun, Evan, Bouhaya, Bakr, O'Brien, Sean, Sharma, Vasu, Zhu, Kevin

arXiv.org Artificial IntelligenceOct-10-2025

Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. However, improvements vary: Arabic shows greater RiddleScore gains than Distinct-2; Japanese sees similar changes. Though focused on riddles, our method may apply to broader creative tasks. Overall, semantic filtering with composite evaluation offers a lightweight path to culturally rich generation without fine-tuning.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2508.18709

Country:

Europe (0.45)
North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Education (0.46)
Media > News (0.46)
Information Technology > Security & Privacy (0.45)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

Add feedback

ICL Optimized Fragility

Wannaz, Serena Gomez

arXiv.org Artificial IntelligenceOct-2-2025

ICL guides are known to improve task-specific performance, but their impact on cross-domain cognitive abilities remains unexplored. This study examines how ICL guides affect reasoning across different knowledge domains using six variants of the GPT-OSS:20b model: one baseline model and five ICL configurations (simple, chain-of-thought, random, appended text, and symbolic language). The models were subjected to 840 tests spanning general knowledge questions, logic riddles, and a mathematical olympiad problem. Statistical analysis (ANOVA) revealed significant behavioral modifications (p less than 0.001) across ICL variants, demonstrating a phenomenon termed "optimized fragility." ICL models achieved 91%-99% accuracy on general knowledge tasks while showing degraded performance on complex reasoning problems, with accuracy dropping to 10-43% on riddles compared to 43% for the baseline model. Notably, no significant differences emerged on the olympiad problem (p=0.2173), suggesting that complex mathematical reasoning remains unaffected by ICL optimization. These findings indicate that ICL guides create systematic trade-offs between efficiency and reasoning flexibility, with important implications for LLM deployment and AI safety.

accuracy, artificial intelligence, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.003

Country: North America > United States (0.15)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.48)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Cognitive Science (0.67)

Add feedback

Filters

Collaborating Authors

riddle

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Retaining Knowledge for Learning with Dynamic Definition

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models

Retaining Dynamic Definition

BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

The Riddle of Reflection: Evaluating Reasoning and Self-Awareness in Multilingual LLMs using Indian Riddles

Don't throw the baby out with the bathwater: How and why deep learning for ARC

PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

fbf5efe979e6754dc06a0869233f2510-Paper-Datasets_and_Benchmarks_Track.pdf

Adaptive Originality Filtering: Rejection Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

ICL Optimized Fragility