AITopics | pre-training data

Collaborating Authors

pre-training data

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Neural Information Processing SystemsApr-29-2026, 23:19:24 GMT

Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zeroshot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pretraining data.

large language model, machine learning, segmentation, (17 more...)

Neural Information Processing Systems

Country: Asia > Middle East > Israel (0.15)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.57)

Add feedback

b9e472cd579c83e2f6aa3459f46aac28-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 19:07:13 GMT

dropout, large language model, machine learning, (17 more...)

Neural Information Processing Systems

Country:

Asia > Singapore (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
Europe > Switzerland > Zürich > Zürich (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Energy (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.99)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

ImprovedFine-TuningbyBetterLeveraging Pre-TrainingData

Neural Information Processing SystemsFeb-12-2026, 02:15:55 GMT

As a dominant paradigm, fine-tuning a pre-trained model on the target data is widely used in many deep learning applications, especially for small data sets.

artificial intelligence, machine learning, pre-training data, (17 more...)

Neural Information Processing Systems

Country: Asia > China (0.04)

Genre: Research Report (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

training

Neural Information Processing SystemsFeb-8-2026, 16:46:46 GMT

Traditional approaches focus on aligning models during the instruction tuning orreinforcement learning stages, referred tointhis paperas'postalignment'.

justification, large language model, machine learning, (20 more...)

Neural Information Processing Systems

Country:

Asia > China > Guangdong Province > Shenzhen (0.05)
Asia > Thailand > Bangkok > Bangkok (0.04)
Asia > China > Jiangsu Province > Changzhou (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis

Neural Information Processing SystemsDec-26-2025, 15:40:02 GMT

Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is likely to be approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.

multi-epoch degradation, name change, scaling llm, (6 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Neural Information Processing SystemsDec-24-2025, 14:32:36 GMT

Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data.In this study, we introduce \textbf{Cross-Care}, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups.We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs.We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups.Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs.Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages.For further exploration and analysis, we make all data and a data visualization tool available at: \url{www.crosscare.net}.

artificial intelligence, large language model, natural language, (10 more...)

Neural Information Processing Systems

Genre: Research Report > New Finding (0.60)

Industry: Health & Medicine (0.76)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Scaling HuBERT for African Languages: From Base to Large and XL

Caubrière, Antoine, Gauthier, Elodie

arXiv.org Artificial IntelligenceDec-1-2025

Despite recent progress in multilingual speech processing, African languages remain under-represented in both research and deployed systems, particularly when it comes to strong, open-weight encoders that transfer well under low-resource supervision. Self-supervised learning has proven especially promising in such settings, yet most publicly released models targeting African speech remain at BASE scale, leaving unanswered whether larger encoders, trained exclusively on Africa-centric audio, offer tangible benefits and how model capacity interacts with data composition. This work addresses that gap by introducing SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters), the first large models trained solely on African speech, alongside a BASE size counterpart. We release these models as open weights: see https://huggingface.co/collections/Orange/african-speech-foundation-models. By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, covering automatic speech recognition (ASR) and language identification (LID) tasks, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.

african language, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2511.2337

Country: Africa (0.26)

Genre:

Research Report > Strength High (1.00)
Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Think Before You Prune: Selective Self-Generated Calibration for Pruning Large Reasoning Models

Xiang, Yang, Ji, Yixin, Li, Juntao, Zhang, Min

arXiv.org Artificial IntelligenceNov-25-2025

Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex reasoning benchmarks. However, their long chain-of-thought reasoning processes incur significant inference overhead. Pruning has emerged as a promising approach to reducing computational costs. However, existing efforts have primarily focused on large language models (LLMs), while pruning LRMs remains unexplored. In this work, we conduct the first empirical study on pruning LRMs and show that directly applying existing pruning techniques fails to yield satisfactory results. Our findings indicate that using self-generated reasoning data for calibration can substantially improve pruning performance. We further investigate how the difficulty and length of reasoning data affect pruning outcomes. Our analysis reveals that challenging and moderately long self-generated reasoning data serve as ideal calibration data. Based on these insights, we propose a Selective Self-Generated Reasoning (SSGR) data construction strategy to provide effective calibration data for pruning LRMs. Experimental results on the DeepSeek-R1-Distill model series validate that our strategy improves the reasoning ability of pruned LRMs by 10%-13% compared to general pruning methods.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.18864

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.34)

Industry: Construction & Engineering (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback

Tahakom LLM Guidelines and Recipes: From Pre-training Data to an Arabic LLM

AlOtaibi, Areej, Alyahya, Lina, Alshabanah, Raghad, Alfawzan, Shahad, Alarefei, Shuruq, Alsabti, Reem, Alsubaie, Nouf, Alhuzaymi, Abdulaziz, Alkhelb, Lujain, Alsayari, Majd, Alahmed, Waad, Talabay, Omar, Alowibdi, Jalal, Alelyani, Salem, Bibi, Adel

arXiv.org Artificial IntelligenceOct-28-2025

Large Language Models (LLMs) have significantly advanced the field of natural language processing, enhancing capabilities in both language understanding and generation across diverse domains. However, developing LLMs for Arabic presents unique challenges. This paper explores these challenges by focusing on critical aspects such as data curation, tokenizer design, and evaluation. We detail our approach to the collection and filtration of Arabic pre-training datasets, assess the impact of various tokenizer designs on model performance, and examine the limitations of existing Arabic evaluation frameworks, for which we propose a systematic corrective methodology. To promote transparency and facilitate collaborative development, we share our data and methodologies, contributing to the advancement of language modeling, particularly for the Arabic language.

large language model, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2510.13481

Country: