pre-training data
Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation
Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zeroshot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pretraining data.
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis
Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is likely to be approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.
Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data.In this study, we introduce \textbf{Cross-Care}, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups.We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs.We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups.Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs.Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages.For further exploration and analysis, we make all data and a data visualization tool available at: \url{www.crosscare.net}.
Scaling HuBERT for African Languages: From Base to Large and XL
Caubriรจre, Antoine, Gauthier, Elodie
Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models
Xiang, Yang, Ji, Yixin, Li, Juntao, Zhang, Min
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.
Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM
AlOtaibi, Areej, Alyahya, Lina, Alshabanah, Raghad, Alfawzan, Shahad, Alarefei, Shuruq, Alsabti, Reem, Alsubaie, Nouf, Alhuzaymi, Abdulaziz, Alkhelb, Lujain, Alsayari, Majd, Alahmed, Waad, Talabay, Omar, Alowibdi, Jalal, Alelyani, Salem, Bibi, Adel
Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.