Goto

Collaborating Authors

 giraffe


The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models

Roll, Nathan, Kries, Jill, Jin, Flora, Wang, Catherine, Finley, Ann Marie, Sumner, Meghan, Shain, Cory, Gwilliams, Laura

arXiv.org Artificial Intelligence

Large language models (LLMs) have emerged as a candidate "model organism" for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB's design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen's kappa = 0.255 for model--consensus agreement vs. 0.286 for human--human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.


A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()

Neural Information Processing Systems

We expand Eq. 10 as: q There are three KL divergence terms in our training objective ELBO (Eq. Medium and Y elp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM Skip connections are also used between adjacent LSTM layers. We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset. Table 6: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype.



Scaling can lead to compositional generalization

Redhardt, Florian, Akram, Yassir, Schug, Simon

arXiv.org Artificial Intelligence

Can neural networks systematically capture discrete, compositional task structure despite their continuous, distributed nature? The impressive capabilities of large-scale neural networks suggest that the answer to this question is yes. However, even for the most capable models, there are still frequent failure cases that raise doubts about their compositionality. Here, we seek to understand what it takes for a standard neural network to generalize over tasks that share compositional structure. We find that simply scaling data and model size leads to compositional generalization. We show that this holds across different task encodings as long as the training distribution sufficiently covers the task space. In line with this finding, we prove that standard multilayer perceptrons can approximate a general class of compositional task families to arbitrary precision using only a linear number of neurons with respect to the number of task modules. Finally, we uncover that if networks successfully compositionally generalize, the constituents of a task can be linearly decoded from their hidden activations. We show that this metric correlates with failures of text-to-image generation models to compose known concepts.


A Derivations of Variational Inference and ELBO A.1 Derivation of optimal q ()

Neural Information Processing Systems

We expand Eq. 10 as: q There are three KL divergence terms in our training objective ELBO (Eq. Medium and Y elp Large datasets, we follow (Guu et al., 2018) to use a three-layer attentional LSTM Skip connections are also used between adjacent LSTM layers. We apply annealing and free-bits techniques following (Li et al., 2019) to the KL term on prototype variable, As in Section 4.3, here we show more generated examples through interpolation on MSCOCO dataset. Table 6: Qualitative examples from the MSCOCO dataset on interpolated sentence generation given the prototype.



R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Zhou, Hengguang, Li, Xirui, Wang, Ruochen, Cheng, Minhao, Zhou, Tianyi, Hsieh, Cho-Jui

arXiv.org Artificial Intelligence

The recent DeepSeek-R1 demonstrated how reinforcement learning with simple rule-based reward can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models, aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities.


Thorns and Algorithms: Navigating Generative AI Challenges Inspired by Giraffes and Acacias

Hussain, Waqar

arXiv.org Artificial Intelligence

The interplay between humans and Generative AI (Gen AI) draws an insightful parallel with the dynamic relationship between giraffes and acacias on the African Savannah. Just as giraffes navigate the acacia's thorny defenses to gain nourishment, humans engage with Gen AI, maneuvering through ethical and operational challenges to harness its benefits. This paper explores how, like young giraffes that are still mastering their environment, humans are in the early stages of adapting to and shaping Gen AI. It delves into the strategies humans are developing and refining to help mitigate risks such as bias, misinformation, and privacy breaches, that influence and shape Gen AI's evolution. While the giraffe-acacia analogy aptly frames human-AI relations, it contrasts nature's evolutionary perfection with the inherent flaws of human-made technology and the tendency of humans to misuse it, giving rise to many ethical dilemmas. Through the HHH framework we identify pathways to embed values of helpfulness, honesty, and harmlessness in AI development, fostering safety-aligned agents that resonate with human values. This narrative presents a cautiously optimistic view of human resilience and adaptability, illustrating our capacity to harness technologies and implement safeguards effectively, without succumbing to their perils. It emphasises a symbiotic relationship where humans and AI continually shape each other for mutual benefit.


Joint Visual and Text Prompting for Improved Object-Centric Perception with Multimodal Large Language Models

Jiang, Songtao, Zhang, Yan, Zhou, Chenyi, Jin, Yeying, Feng, Yang, Wu, Jian, Liu, Zuozhu

arXiv.org Artificial Intelligence

Multimodal Large Language Models (MLLMs) such as GPT-4V and Gemini Pro face challenges in achieving human-level perception in Visual Question Answering (VQA), particularly in object-oriented perception tasks which demand fine-grained understanding of object identities, locations or attributes, as indicated by empirical findings. This is mainly due to their limited capability to effectively integrate complex visual cues with textual information and potential object hallucinations. In this paper, we present a novel approach, Joint Visual and Text Prompting (VTPrompt), that employs fine-grained visual information to enhance the capability of MLLMs in VQA, especially for object-oriented perception. VTPrompt merges visual and text prompts to extract key concepts from textual questions and employs a detection model to highlight relevant objects as visual prompts in images. The processed images alongside text prompts are subsequently fed into MLLMs to produce more accurate answers. Our experiments with GPT-4V and Gemini Pro, on three benchmarks, i.e., MME , MMB and POPE, demonstrate significant improvements. Particularly, our method led to a score improvement of up to 183.5 for GPT-4V on MME and enhanced MMB performance by 8.17\% for GPT-4V and 15.69\% for Gemini Pro.


VisPercep: A Vision-Language Approach to Enhance Visual Perception for People with Blindness and Low Vision

Hao, Yu, Yang, Fan, Huang, Hao, Yuan, Shuaihang, Rangan, Sundeep, Rizzo, John-Ross, Wang, Yao, Fang, Yi

arXiv.org Artificial Intelligence

People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification in unfamiliar environments. Additionally, due to the vision loss, pBLV have difficulty in accessing and identifying potential tripping hazards on their own. In this paper, we present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV, offering detailed and comprehensive descriptions of the surrounding environments and providing warnings about the potential risks. Our method begins by leveraging a large image tagging model (i.e., Recognize Anything (RAM)) to identify all common objects present in the captured images. The recognition results and user query are then integrated into a prompt, tailored specifically for pBLV using prompt engineering. By combining the prompt and input image, a large vision-language model (i.e., InstructBLIP) generates detailed and comprehensive descriptions of the environment and identifies potential risks in the environment by analyzing the environmental objects and scenes, relevant to the prompt. We evaluate our approach through experiments conducted on both indoor and outdoor datasets. Our results demonstrate that our method is able to recognize objects accurately and provide insightful descriptions and analysis of the environment for pBLV.