AITopics

Country:

North America > United States (0.46)
North America > Mexico (0.28)
Asia > Middle East (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-9-2026, 17:47:27 GMT

A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()

We expand Eq. 10 as: q There are three KL divergence terms in our training objective ELBO (Eq. Medium and Y elp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM Skip connections are also used between adjacent LSTM layers. We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset. Table 6: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype.

artificial intelligence, machine learning, prototype, (15 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceNov-26-2025

The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

Roll, Nathan, Kries, Jill, Jin, Flora, Wang, Catherine, Finley, Ann Marie, Sumner, Meghan, Shain, Cory, Gwilliams, Laura

Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.

aphasia battery, large language model, machine learning, (19 more...)

2511.20507

Country: North America > United States > California (0.68)

Genre: Research Report (0.50)

Industry: Health & Medicine > Therapeutic Area > Neurology (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Redhardt, Florian, Akram, Yassir, Schug, Simon

Scaling can lead to compositional generalization

arXiv.org Artificial IntelligenceOct-27-2025

Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.

artificial intelligence, compositional generalization, machine learning, (17 more...)

2507.07207

Country:

Europe (0.67)
North America > United States (0.46)
North America > Mexico (0.28)
Asia > Middle East (0.28)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsAug-15-2025, 16:14:10 GMT

A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()

dataset, derivation, prototype, (13 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Neural Information Processing SystemsAug-14-2025, 19:17:54 GMT

Supplementary Material I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification

Our novel I2DFormer which only utilizes the I2DAttention module, outperforms all these baselines in row d). Our model is designed with the problem constraints of our ZSL setting and the resulting information asymmetry in mind.

i2dattention module, i2dformer, information, (15 more...)

Country:

North America > United States (0.05)
South America > Brazil (0.04)
Oceania > New Zealand (0.04)
(6 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.65)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.42)
Information Technology > Artificial Intelligence > Vision > Image Understanding (0.40)

arXiv.org Artificial IntelligenceMar-9-2025

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Zhou, Hengguang, Li, Xirui, Wang, Ruochen, Cheng, Minhao, Zhou, Tianyi, Hsieh, Cho-Jui

The recent DeepSeek-R1 demonstrated how reinforcement learning with simple rule-based reward can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models, aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities.

reasoning, reasoning capability, response length, (16 more...)

2503.05132

Country:

North America > United States > California (0.05)
North America > United States > Pennsylvania (0.04)
North America > United States > Maryland (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.39)

arXiv.org Artificial IntelligenceJul-15-2024

Thorns and Algorithms: Navigating Generative AI Challenges Inspired by Giraffes and Acacias

Hussain, Waqar

The interplay between humans and Generative AI (Gen AI) draws an insightful parallel with the dynamic relationship between giraffes and acacias on the African Savannah. Just as giraffes navigate the acacia's thorny defenses to gain nourishment, humans engage with Gen AI, maneuvering through ethical and operational challenges to harness its benefits. This paper explores how, like young giraffes that are still mastering their environment, humans are in the early stages of adapting to and shaping Gen AI. It delves into the strategies humans are developing and refining to help mitigate risks such as bias, misinformation, and privacy breaches, that influence and shape Gen AI's evolution. While the giraffe-acacia analogy aptly frames human-AI relations, it contrasts nature's evolutionary perfection with the inherent flaws of human-made technology and the tendency of humans to misuse it, giving rise to many ethical dilemmas. Through the HHH framework we identify pathways to embed values of helpfulness, honesty, and harmlessness in AI development, fostering safety-aligned agents that resonate with human values. This narrative presents a cautiously optimistic view of human resilience and adaptability, illustrating our capacity to harness technologies and implement safeguards effectively, without succumbing to their perils. It emphasises a symbiotic relationship where humans and AI continually shape each other for mutual benefit.

generative ai, giraffe vs generative ai, thorn and algorithm, (13 more...)

2407.1136

Country:

North America > United States (0.28)
Europe (0.14)
Oceania > Australia > Victoria > Melbourne (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry:

Media (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.92)

arXiv.org Artificial IntelligenceApr-6-2024

Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Jiang, Songtao, Zhang, Yan, Zhou, Chenyi, Jin, Yeying, Feng, Yang, Wu, Jian, Liu, Zuozhu

Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.

arxiv preprint arxiv, perception, preprint arxiv, (16 more...)