AITopics | data collection strategy

Collaborating Authors

data collection strategy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

LearningBeamSearchPoliciesviaImitation Learning

Neural Information Processing SystemsFeb-13-2026, 19:52:59 GMT

Beam search is widely used for approximate decoding in structured prediction problems. Models often use a beam at test time but ignore its existence at train time, and therefore do not explicitly learn how to use the beam.

algorithm, artificial intelligence, machine learning, (13 more...)

Neural Information Processing Systems

Country:

North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > Canada > Quebec > Montreal (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.51)

Add feedback

28312c9491d60ed0c77f7fff4ad86dd1-Paper-Conference.pdf

Neural Information Processing SystemsFeb-9-2026, 05:45:56 GMT

estimator, optimization problem, variance, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.71)

Add feedback

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Neural Information Processing SystemsDec-24-2025, 06:53:47 GMT

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we consider as quality index the variance of an unbiased policy return estimator that uses trajectories of different lengths, i.e., . We first derive a closed-form expression of this variance that clearly shows the sub-optimality of the fixed-length trajectory schedule. Furthermore, it suggests that adaptive data collection strategies that spend the available budget sequentially might be able to allocate a larger portion of transitions in timesteps in which more accurate sampling is required to reduce the variance of the final estimate.

monte carlo policy evaluation, name change, truncating trajectory, (9 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.59)

Add feedback

Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset

Zhang, Lily Hong, Milli, Smitha, Jusko, Karen, Smith, Jonathan, Amos, Brandon, Bouaziz, Wassim, Revel, Manon, Kussman, Jack, Sheynin, Yasha, Titus, Lisa, Radharapu, Bhaktipriya, Yu, Jane, Sarma, Vidya, Rose, Kris, Nickel, Maximilian

arXiv.org Artificial IntelligenceOct-28-2025

How can large language models (LLMs) serve users with varying preferences that may conflict across cultural, political, or other dimensions? To advance this challenge, this paper establishes four key results. First, we demonstrate, through a large-scale multilingual human study with representative samples from five countries (N=15,000), that humans exhibit significantly more variation in preferences than the responses of 21 state-of-the-art LLMs. Second, we show that existing methods for preference dataset collection are insufficient for learning the diversity of human preferences even along two of the most salient dimensions of variability in global values, due to the underlying homogeneity of candidate responses. Third, we argue that this motivates the need for negatively-correlated sampling when generating candidate sets, and we show that simple prompt-based techniques for doing so significantly enhance the performance of alignment methods in learning heterogeneous preferences. Fourth, based on this novel candidate sampling approach, we collect and open-source Community Alignment, the largest and most representative multilingual and multi-turn preference dataset to date, featuring almost 200,000 comparisons from annotators spanning five countries. We hope that the Community Alignment dataset will be a valuable resource for improving the effectiveness of LLMs for a diverse global population.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.0965

Country:

Europe (1.00)
Asia (1.00)
South America (0.92)
North America > United States > New York (0.27)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.92)
(4 more...)

Industry:

Water & Waste Management > Water Management > Water Supplies & Services (1.00)
Media > Music (1.00)
Materials > Chemicals > Agricultural Chemicals (1.00)
(17 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Neural Information Processing SystemsOct-8-2025, 08:00:07 GMT

However, is this data collection strategy the best option?

estimator, optimization problem, variance, (15 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.71)

Add feedback

Guiding Data Collection via Factored Scaling Curves

Zha, Lihan, Badithela, Apurva, Zhang, Michael, Lidard, Justin, Bao, Jeremy, Zhou, Emily, Snyder, David, Ren, Allen Z., Shah, Dhruv, Majumdar, Anirudha

arXiv.org Artificial IntelligenceMay-13-2025

Generalist imitation learning policies trained on large datasets show great promise for solving diverse manipulation tasks. However, to ensure generalization to different conditions, policies need to be trained with data collected across a large set of environmental factor variations (e.g., camera pose, table height, distractors) $-$ a prohibitively expensive undertaking, if done exhaustively. We introduce a principled method for deciding what data to collect and how much to collect for each factor by constructing factored scaling curves (FSC), which quantify how policy performance varies as data scales along individual or paired factors. These curves enable targeted data acquisition for the most influential factor combinations within a given budget. We evaluate the proposed method through extensive simulated and real-world experiments, across both training-from-scratch and fine-tuning settings, and show that it boosts success rates in real-world tasks in new environments by up to 26% over existing data-collection strategies. We further demonstrate how factored scaling curves can effectively guide data collection using an offline metric, without requiring real-world evaluation at scale.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Artificial Intelligence

2505.07728

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Robots (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
(3 more...)

Add feedback

On the Benefits of Active Data Collection in Operator Learning

Subedi, Unique, Tewari, Ambuj

arXiv.org Machine LearningDec-5-2024

We investigate active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. Thus, with sufficiently rapid eigenvalue decay of the covariance kernels, arbitrarily fast error convergence rates can be achieved. This contrasts with the passive (i.i.d.) data collection strategies, where the convergence rate is never faster than $\sim n^{-1}$. In fact, for our setting, we establish a \emph{non-vanishing} lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active over passive data collection strategies in operator learning.

collection strategy, data collection strategy, operator, (16 more...)

arXiv.org Machine Learning

2410.19725

Country:

North America > United States > Michigan (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.88)

Add feedback

Survey of Cultural Awareness in Language Models: Text and Beyond

Pawar, Siddhesh, Park, Junyeong, Jin, Jiho, Arora, Arnav, Myung, Junho, Yadav, Srishti, Haznitrama, Faiz Ghifari, Song, Inhwa, Oh, Alice, Augenstein, Isabelle

arXiv.org Artificial IntelligenceOct-30-2024

Large-scale deployment of large language models (LLMs) in various applications, such as chatbots and virtual assistants, requires LLMs to be culturally sensitive to the user to ensure inclusivity. Culture has been widely studied in psychology and anthropology, and there has been a recent surge in research on making LLMs more culturally inclusive in LLMs that goes beyond multilinguality and builds on findings from psychology and anthropology. In this paper, we survey efforts towards incorporating cultural awareness into text-based and multimodal LLMs. We start by defining cultural awareness in LLMs, taking the definitions of culture from anthropology and psychology as a point of departure. We then examine methodologies adopted for creating cross-cultural datasets, strategies for cultural inclusion in downstream tasks, and methodologies that have been used for benchmarking cultural awareness in LLMs. Further, we discuss the ethical implications of cultural alignment, the role of Human-Computer Interaction in driving cultural inclusion in LLMs, and the role of cultural alignment in driving social science research. We finally provide pointers to future research based on our findings about gaps in the literature.

dataset creation methodology, neural information processing system 2022, neural information processing system 35, (16 more...)

arXiv.org Artificial Intelligence

2411.0086

Country:

North America > United States > Washington > King County > Seattle (0.27)
Asia > South Korea (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
(67 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Leisure & Entertainment (1.00)
Health & Medicine > Therapeutic Area (1.00)
Education > Educational Setting > K-12 Education (1.00)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.46)

Add feedback

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Neural Information Processing SystemsOct-10-2024, 14:59:48 GMT

Policy evaluation via Monte Carlo (MC) simulation is at the core of many MC Reinforcement Learning (RL) algorithms (e.g., policy gradient methods). In this context, the designer of the learning system specifies an interaction budget that the agent usually spends by collecting trajectories of fixed length within a simulator. However, is this data collection strategy the best option? To answer this question, in this paper, we consider as quality index the variance of an unbiased policy return estimator that uses trajectories of different lengths, i.e., truncated. We first derive a closed-form expression of this variance that clearly shows the sub-optimality of the fixed-length trajectory schedule.

adaptive approach, monte carlo policy evaluation, truncating trajectory, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.61)

Add feedback

Filters

Collaborating Authors

data collection strategy

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

LearningBeamSearchPoliciesviaImitation Learning

28312c9491d60ed0c77f7fff4ad86dd1-Paper-Conference.pdf

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

967c2ae04b169f07e7fa8fdfd110551e-Paper.pdf

Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach

Guiding Data Collection via Factored Scaling Curves

On the Benefits of Active Data Collection in Operator Learning

Survey of Cultural Awareness in Language Models: Text and Beyond

Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach