Commonsense Reasoning
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization
Foroutan, Negar, Meister, Clara, Paul, Debjit, Niklaus, Joel, Ahmadi, Sina, Bosselut, Antoine, Sennrich, Rico
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with
model on any particular supervised task). We compared with GPT-2 (345M) on the Winograd Schema Challenge
Interesting to see how well the proposed model would do under such zero-shot setup (i.e. GPT -2 accuracy is taken from their paper. The BERT paper reports that BooksCorpus and Wikipedia contain 0.8B and 2.5B words, respectively. For our processed data, BooksCorpus and Wikipedia contain 0.75B and 2B words, respectively. The implementation is the same as word embedding, i.e., a lookup "Segment 1", and "Segment 2") and feed it to model input, which indicates the segment of input tokens.
Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models
Luo, Wenjie, Li, Ruocheng, Zhu, Shanshan, Perry, Julian
--Despite significant advancements, current large language models (LLMs) and vision-language models (L VLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. T o address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances L VLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaV A-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source L VLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning. The remarkable advancements in large language models (LLMs) [1], [2] and vision-language models (L VLMs) have revolutionized various aspects of artificial intelligence, demonstrating unprecedented capabilities in understanding, generating, and processing information across modalities [3]. These models excel in tasks ranging from complex question answering to creative content generation, largely due to their extensive pre-training on vast amounts of data.
AutoMixer: Checkpoint Artifacts as Automatic Data Mixers
Chang, Ernie, Li, Yang, Huber, Patrick, Vogeti, Vish, Kant, David, Shi, Yangyang, Chandra, Vikas
In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.
Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product
Albert, Paul, Zhang, Frederic Z., Saratchandran, Hemanth, Hengel, Anton van den, Abbasnejad, Ehsan
Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components -- signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.
A Human-in-the-loop Approach to Robot Action Replanning through LLM Common-Sense Reasoning
Merlo, Elena, Lagomarsino, Marta, Ajoudani, Arash
To facilitate the wider adoption of robotics, accessible programming tools are required for non-experts. Observational learning enables intuitive human skills transfer through hands-on demonstrations, but relying solely on visual input can be inefficient in terms of scalability and failure mitigation, especially when based on a single demonstration. This paper presents a human-in-the-loop method for enhancing the robot execution plan, automatically generated based on a single RGB video, with natural language input to a Large Language Model (LLM). By including user-specified goals or critical task aspects and exploiting the LLM common-sense reasoning, the system adjusts the vision-based plan to prevent potential failures and adapts it based on the received instructions. Experiments demonstrated the framework intuitiveness and effectiveness in correcting vision-derived errors and adapting plans without requiring additional demonstrations. Moreover, interactive plan refinement and hallucination corrections promoted system robustness.