AITopics | data quantity

artificial intelligence, machine learning, robustness, (17 more...)

Neural Information Processing Systems

Industry: Transportation (0.31)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.70)

Add feedback

Granularity__final

Thao Nguyen

Neural Information Processing SystemsFeb-17-2026, 06:28:33 GMT

We use the iWildCam version 2.0 released in 2021 as a Examples of train set images can be seen in Figure 14. Random examples from the out-of-distribution test set. Figure 15 shows examples of train set images. Figure 15: Random examples from the ImageNet ILSVRC 2012 challenge train set [37, 11]. The full training set is notably not class balanced, exhibiting a long-tailed distribution (see Figure 16). Figure 17: Random examples from the iNaturalist 2017 challenge train set [46].

artificial intelligence, machine learning, robustness, (17 more...)

Neural Information Processing Systems

Industry: Transportation > Ground > Rail (0.97)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.66)

Add feedback

Granularity__final

Thao Nguyen

Neural Information Processing SystemsFeb-17-2026, 06:28:30 GMT

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country: Europe > Poland (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Sensing and Signal Processing > Image Processing (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Neural Information Processing SystemsDec-26-2025, 20:39:13 GMT

Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a fine-tuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness (Taori et al., 2020) is data quantity, while other factors have limited significance. For example, reducing the number of ImageNet pre-training classes by 4x while increasing the number of images per class by 4x (that is, keeping total data quantity fixed) does not impact the robustness of fine-tuned models. We demonstrate our findings on pre-training distributions drawn from various natural and synthetic data sources, primarily using the iWildCam-WILDS distribution shift as a test for robustness.

data diversity and fine-tuning robustness, pre-training data diversity, pre-training distribution, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Granularity__final

Thao Nguyen

Neural Information Processing SystemsOct-9-2025, 08:09:48 GMT

Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited.

artificial intelligence, machine learning, robustness, (18 more...)

Neural Information Processing Systems

Country: Europe > Poland (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.89)

Add feedback

Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

Wille, Melanie, Fischer, Tobias, Raine, Scarlett

arXiv.org Artificial IntelligenceOct-2-2025

Underwater object detection is critical for monitoring marine ecosystems but poses unique challenges, including degraded image quality, imbalanced class distribution, and distinct visual characteristics. Not every species is detected equally well, yet underlying causes remain unclear. We address two key research questions: 1) What factors beyond data quantity drive class-specific performance disparities? 2) How can we systematically improve detection of under-performing marine species? We manipulate the DUO and RUOD datasets to separate the object detection task into localization and classification and investigate the under-performance of the scallop class. Localization analysis using YOLO11 and TIDE finds that foreground-background discrimination is the most problematic stage regardless of data quantity. Classification experiments reveal persistent precision gaps even with balanced data, indicating intrinsic feature-based challenges beyond data scarcity and inter-class dependencies. We recommend imbalanced distributions when prioritizing precision, and balanced distributions when prioritizing recall. Improving under-performing classes should focus on algorithmic advances, especially within localization modules. We publicly release our code and datasets.

artificial intelligence, detection, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2508.18729

Country: Oceania > Australia (0.28)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study > Negative Result (0.34)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Neural Information Processing SystemsFeb-11-2025, 12:30:48 GMT

Pre-training has been widely adopted in deep learning to improve model performance, especially when the training data for a target task is limited. In our work, we seek to understand the implications of this training strategy on the generalization properties of downstream models. More specifically, we ask the following question: how do properties of the pre-training distribution affect the robustness of a fine-tuned model? The properties we explore include the label space, label semantics, image diversity, data domains, and data quantity of the pre-training distribution. We find that the primary factor influencing downstream effective robustness (Taori et al., 2020) is data quantity, while other factors have limited significance.

data diversity and fine-tuning robustness, pre-training data diversity, pre-training distribution, (2 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

Sajith, Aryan, Kathala, Krishna Chaitanya Rao

arXiv.org Artificial IntelligenceNov-24-2024

This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at 100% duplication). The implications of this exploration extend beyond just model performance; training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large, especially in developing countries. Additionally, the energy consumption associated with large-scale training raises environmental concerns. Understanding the relative importance of data quality versus quantity could democratize AI technology, making advanced models more accessible and sustainable for all.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2411.15821

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.05)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

Fan, Dongyang, Messmer, Bettina, Jaggi, Martin

arXiv.org Artificial IntelligenceOct-1-2024

On-device LLMs have gained increasing attention for their ability to enhance privacy and provide a personalized user experience. To facilitate learning with private and scarce local data, federated learning has become a standard approach, though it introduces challenges related to system and data heterogeneity among end users. As a solution, we propose a novel Collaborative learning approach with a Mixture of Generalists and Specialists (CoMiGS), being the first to effectively address both. Our approach distinguishes generalists and specialists by aggregating certain experts across end users while keeping others localized to specialize in user-specific datasets. A key innovation of our method is the bi-level optimization formulation of the Mixture-of-Experts learning objective, where the router is updated using a separate validation set that represents the target distribution. CoMiGS effectively balances collaboration and personalization, as demonstrated by its superior performance in scenarios with high data heterogeneity across multiple datasets. By decoupling resource abundance from data quantity, CoMiGS remains robust against overfitting--due to the generalists' regularizing effect--while adapting to local data through specialist expertise. Large Language Models (LLMs) have been showing great success serving as foundation models, evidenced by their capability to understand a wide range of tasks, such as ChatGPT (OpenAI, 2023), Claude (Anthropic, 2023), Gemini (DeepMind, 2023) and etc. However, cloud-based inference introduces significant delays for end users, and it often fails to meet their personalized needs (Ding et al., 2024; Iyengar & Adusumilli, 2024). Recently, there has been growing interest in deploying LLMs on edge devices, which offer benefits like lower latency, data localization, and more personalized user experiences (Xu et al., 2024). For instance, Apple (2024) recently launched on-device foundation models as part of its personal intelligence system. On-device LLMs present challenges such as limited and variable computational resources, scarce and heterogeneous local data, and privacy concerns related to data sharing (Peng et al., 2024; Wagner et al., 2024).

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2409.13931

Country:

Asia > Middle East > Jordan (0.04)
North America > United States > New York > New York County > New York City (0.04)
Europe > Switzerland > Vaud > Lausanne (0.04)
Asia > Thailand > Bangkok > Bangkok (0.04)

Genre: Research Report (0.64)

Industry: Information Technology > Security & Privacy (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs

Kang, Feiyang, Sun, Yifan, Wen, Bingbing, Chen, Si, Song, Dawn, Mahmood, Rafid, Jia, Ruoxi

arXiv.org Machine LearningJul-29-2024

To ensure performance on a diverse set of downstream tasks, LLMs are pretrained via data mixtures over different domains. In this work, we demonstrate that the optimal data composition for a fixed compute budget varies depending on the scale of the training data, suggesting that the common practice of empirically determining an optimal composition using small-scale experiments will not yield the optimal data mixtures when scaling up to the final model. To address this challenge, we propose AutoScale, an automated tool that finds a compute-optimal data composition for training at any desired target scale. AutoScale first determines the optimal composition at a small scale using a novel bi-level optimization framework, Direct Data Optimization (DDO), and then fits a predictor to estimate the optimal composition at larger scales. The predictor's design is inspired by our theoretical analysis of scaling laws related to data composition, which could be of independent interest. In empirical studies with pre-training 774M Decoder-only LMs (GPT-2 Large) on RedPajama dataset, AutoScale decreases validation perplexity at least 25% faster than any baseline with up to 38% speed up compared to without reweighting, achieving the best overall performance across downstream tasks. On pre-training Encoder-only LMs (BERT) with masked language modeling, compared with without reweighting, DDO is shown to decrease loss on all domains while visibly improving average task performance on GLUE benchmark by 8.7% and on large-scale QA dataset (SQuAD) by 5.9%, where AutoScale further speeds up training by up to 28%.

arxiv preprint arxiv, composition, data composition, (14 more...)

arXiv.org Machine Learning

2407.20177

Country:

North America > United States > Virginia (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Illinois (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Filters

Collaborating Authors

data quantity

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Granularity__final

Granularity__final

Granularity__final

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Granularity__final

Are All Marine Species Created Equal? Performance Disparities in Underwater Object Detection

On the Connection between Pre-training Data Diversity and Fine-tuning Robustness

Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?

On-Device Collaborative Language Modeling via a Mixture of Generalists and Specialists

AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs