Goto

Collaborating Authors

 Commonsense Reasoning


Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

arXiv.org Artificial Intelligence

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.



model on any particular supervised task). We compared with GPT-2 (345M) on the Winograd Schema Challenge

Neural Information Processing Systems

Interesting to see how well the proposed model would do under such zero-shot setup (i.e. GPT -2 accuracy is taken from their paper. The BERT paper reports that BooksCorpus and Wikipedia contain 0.8B and 2.5B words, respectively. For our processed data, BooksCorpus and Wikipedia contain 0.75B and 2B words, respectively. The implementation is the same as word embedding, i.e., a lookup "Segment 1", and "Segment 2") and feed it to model input, which indicates the segment of input tokens.


A Additional prompt data details

Neural Information Processing Systems

Desination will be a red barn on the right 1. Continued on next page 18 Use Case Example rewrite Rewrite the following text to be more light-hearted: -- {very formal text} -- chat The following is a conversation with an AI assistant.




Coherent Multimodal Reasoning with Iterative Self-Evaluation for Vision-Language Models

arXiv.org Artificial Intelligence

--Despite significant advancements, current large language models (LLMs) and vision-language models (L VLMs) continue to struggle with complex, multi-step, cross-modal common sense reasoning tasks, often exhibiting a lack of "deliberative thinking." They tend to rely on superficial associations rather than deep, chained inference, particularly when integrating visual information with abstract concepts. T o address this, we propose the Coherent Multimodal Reasoning Framework (CMRF), a novel approach that enhances L VLMs' common sense reasoning capabilities through an iterative, self-evaluating inference mechanism. CMRF mimics human problem-solving by decomposing complex queries, generating step-by-step inferences, and self-correcting errors. Coupled with an Adaptive Iterative Refinement strategy, CMRF systematically refines its reasoning paths. Built upon LLaV A-1.6-34B and trained on a novel Multimodal Daily Activity Reasoning (MDAR) dataset, CMRF achieves state-of-the-art performance among open-source L VLMs on challenging benchmarks like VCR, A-OKVQA, and DailyLife-MRC. Extensive ablation studies and human evaluations confirm the critical contributions of each module and the effectiveness of iterative refinement in fostering more coherent and accurate reasoning. The remarkable advancements in large language models (LLMs) [1], [2] and vision-language models (L VLMs) have revolutionized various aspects of artificial intelligence, demonstrating unprecedented capabilities in understanding, generating, and processing information across modalities [3]. These models excel in tasks ranging from complex question answering to creative content generation, largely due to their extensive pre-training on vast amounts of data.


AutoMixer: Checkpoint Artifacts as Automatic Data Mixers

arXiv.org Artificial Intelligence

In language model training, it is desirable to equip models with capabilities from various tasks. However, it is not clear how to directly obtain the right data mixtures for these capabilities as the relationship between data and tasks is difficult to be modeled. In this work, we observe that checkpoint models exhibit emerging capabilities at different points in the training trajectory. Often, the training process saves checkpoints as artifacts that are under-utilized as a source of in-training data signals. We identify these artifact models based on their respective capabilities on the benchmarks and leverage them as data mixers by using their aggregated first-order influence approximation over source data. We demonstrated on eight reasoning benchmarks that the proposed framework shows significant improvements in the pretraining setting, with performance improvements of up to 1.93%. Overall, this shows the potential of checkpoint models to enhance data quality and optimize data mixtures.


Towards Higher Effective Rank in Parameter-efficient Fine-tuning using Khatri--Rao Product

arXiv.org Artificial Intelligence

Parameter-efficient fine-tuning (PEFT) has become a standard approach for adapting large pre-trained models. Amongst PEFT methods, low-rank adaptation (LoRA) has achieved notable success. However, recent studies have highlighted its limitations compared against full-rank alternatives, particularly when applied to multimodal and large language models. In this work, we present a quantitative comparison amongst full-rank and low-rank PEFT methods using a synthetic matrix approximation benchmark with controlled spectral properties. Our results confirm that LoRA struggles to approximate matrices with relatively flat spectrums or high frequency components -- signs of high effective ranks. To this end, we introduce KRAdapter, a novel PEFT algorithm that leverages the Khatri-Rao product to produce weight updates, which, by construction, tends to produce matrix product with a high effective rank. We demonstrate performance gains with KRAdapter on vision-language models up to 1B parameters and on large language models up to 8B parameters, particularly on unseen common-sense reasoning tasks. In addition, KRAdapter maintains the memory and compute efficiency of LoRA, making it a practical and robust alternative to fine-tune billion-scale parameter models.


A Human-in-the-loop Approach to Robot Action Replanning through LLM Common-Sense Reasoning

arXiv.org Artificial Intelligence

To facilitate the wider adoption of robotics, accessible programming tools are required for non-experts. Observational learning enables intuitive human skills transfer through hands-on demonstrations, but relying solely on visual input can be inefficient in terms of scalability and failure mitigation, especially when based on a single demonstration. This paper presents a human-in-the-loop method for enhancing the robot execution plan, automatically generated based on a single RGB video, with natural language input to a Large Language Model (LLM). By including user-specified goals or critical task aspects and exploiting the LLM common-sense reasoning, the system adjusts the vision-based plan to prevent potential failures and adapts it based on the received instructions. Experiments demonstrated the framework intuitiveness and effectiveness in correcting vision-derived errors and adapting plans without requiring additional demonstrations. Moreover, interactive plan refinement and hallucination corrections promoted system robustness.