AITopics | data tuple

Collaborating Authors

data tuple

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Chunked Data Shapley: A Scalable Dataset Quality Assessment for Machine Learning

Loizou, Andreas, Tsoumakos, Dimitrios

arXiv.org Artificial IntelligenceAug-25-2025

As the volume and diversity of available datasets continue to increase, assessing data quality has become crucial for reliable and efficient Machine Learning analytics. A modern, game-theoretic approach for evaluating data quality is the notion of Data Shapley which quantifies the value of individual data points within a dataset. State-of-the-art methods to scale the NP-hard Shapley computation also face severe challenges when applied to large-scale datasets, limiting their practical use. In this work, we present a Data Shapley approach to identify a dataset's high-quality data tuples, Chunked Data Shapley (C-DaSh). C-DaSh scalably divides the dataset into manageable chunks and estimates the contribution of each chunk using optimized subset selection and single-iteration stochastic gradient descent. This approach drastically reduces computation time while preserving high quality results. We empirically benchmark our method on diverse real-world classification and regression tasks, demonstrating that C-DaSh outperforms existing Shapley approximations in both computational efficiency (achieving speedups between 80x - 2300x) and accuracy in detecting low-quality data regions. Our method enables practical measurement of dataset quality on large tabular datasets, supporting both classification and regression pipelines.

data quality, data tuple, machine learning, (19 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3746252.3761305

2508.16255

Country: Europe > Greece (0.14)

Genre: Research Report (0.84)

Industry: Health & Medicine (0.94)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.53)

Add feedback

OpenMLDB: A Real-Time Relational Data Feature Computation System for Online ML

Zhou, Xuanhe, Zhou, Wei, Qi, Liguo, Zhang, Hao, Chen, Dihao, He, Bingsheng, Lu, Mian, Li, Guoliang, Wu, Fan, Chen, Yuqiang

arXiv.org Artificial IntelligenceJan-15-2025

Efficient and consistent feature computation is crucial for a wide range of online ML applications. Typically, feature computation is divided into two distinct phases, i.e., offline stage for model training and online stage for model serving. These phases often rely on execution engines with different interface languages and function implementations, causing significant inconsistencies. Moreover, many online ML features involve complex time-series computations (e.g., functions over varied-length table windows) that differ from standard streaming and analytical queries. Existing data processing systems (e.g., Spark, Flink, DuckDB) often incur multi-second latencies for these computations, making them unsuitable for real-time online ML applications that demand timely feature updates. This paper presents OpenMLDB, a feature computation system deployed in 4Paradigm's SageOne platform and over 100 real scenarios. Technically, OpenMLDB first employs a unified query plan generator for consistent computation results across the offline and online stages, significantly reducing feature deployment overhead. Second, OpenMLDB provides an online execution engine that resolves performance bottlenecks caused by long window computations (via pre-aggregation) and multi-table window unions (via data self-adjusting). It also provides a high-performance offline execution engine with window parallel optimization and time-aware data skew resolving. Third, OpenMLDB features a compact data format and stream-focused indexing to maximize memory usage and accelerate data access. Evaluations in testing and real workloads reveal significant performance improvements and resource savings compared to the baseline systems. The open community of OpenMLDB now has over 150 contributors and gained 1.6k stars on GitHub.

computation, openmldb, optimization, (16 more...)

arXiv.org Artificial Intelligence

2501.08591

Country:

Asia > China > Shanghai > Shanghai (0.04)
Oceania > Australia > Victoria > Melbourne (0.04)
Oceania > Australia > Australian Capital Territory > Canberra (0.04)
(9 more...)

Genre: Research Report (0.82)

Industry:

Banking & Finance (0.67)
Information Technology (0.67)

Technology:

Information Technology > Data Science > Data Mining > Big Data (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval > Query Processing (0.67)

Add feedback

Criticality and Safety Margins for Reinforcement Learning

Grushin, Alexander, Woods, Walt, Velasquez, Alvaro, Khan, Simon

arXiv.org Artificial IntelligenceSep-26-2024

State of the art reinforcement learning methods sometimes encounter unsafe situations. Identifying when these situations occur is of interest both for post-hoc analysis and during deployment, where it might be advantageous to call out to a human overseer for help. Efforts to gauge the criticality of different points in time have been developed, but their accuracy is not well established due to a lack of ground truth, and they are not designed to be easily interpretable by end users. Therefore, we seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality. Safety margins make these interpretable, when defined as the number of random actions for which performance loss will not exceed some tolerance with high confidence. We demonstrate this approach in several environment-agent combinations; for an A3C agent in an Atari Beamrider environment, the lowest 5% of safety margins contain 47% of agent losses; i.e., supervising only 5% of decisions could potentially prevent roughly half of an agent's errors. This criticality framework measures the potential impacts of bad decisions, even before those decisions are made, allowing for more effective debugging and oversight of autonomous agents.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Artificial Intelligence

2409.18289

Country:

North America > United States > Oregon > Multnomah County > Portland (0.04)
North America > United States > Colorado > Boulder County > Boulder (0.04)
North America > United States > California > Monterey County > Monterey (0.04)
(2 more...)

Genre: Research Report (0.64)

Industry:

Government (0.46)
Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Absolute Shapley Value

Liu, Jinfei

arXiv.org Machine LearningMar-23-2020

Shapley value is a concept in cooperative game theory for measuring the contribution of each participant, which was named in honor of Lloyd Shapley. Shapley value has been recently applied in data marketplaces for compensation allocation based on their contribution to the models. Shapley value is the only value division scheme used for compensation allocation that meets three desirable criteria: group rationality, fairness, and additivity. In cooperative game theory, the marginal contribution of each contributor to each coalition is a nonnegative value. However, in machine learning model training, the marginal contribution of each contributor (data tuple) to each coalition (a set of data tuples) can be a negative value, i.e., the accuracy of the model trained by a dataset with an additional data tuple can be lower than the accuracy of the model trained by the dataset only. In this paper, we investigate the problem of how to handle the negative marginal contribution when computing Shapley value. We explore three philosophies: 1) taking the original value (Original Shapley Value); 2) taking the larger of the original value and zero (Zero Shapley Value); and 3) taking the absolute value of the original value (Absolute Shapley Value). Experiments on Iris dataset demonstrate that the definition of Absolute Shapley Value significantly outperforms the other two definitions in terms of evaluating data importance (the contribution of each data tuple to the trained model).

data tuple, shapley value, tuple, (13 more...)

arXiv.org Machine Learning

2003.10076

Country:

North America > United States > California > Los Angeles County > Long Beach (0.05)
Europe > Italy > Marche > Ancona Province > Ancona (0.04)
Europe > France (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Game Theory (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.32)

Add feedback