AITopics | dataspace

What do row and column marginals reveal about your dataset?

Neural Information Processing SystemsSep-30-2025, 12:38:49 GMT

Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i,j) of interest. We do this, for all the cells of H simultaneously, without generating realizations but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with running time equal to the time required by standard sampling techniques to generate a single dataset from the same dataspace. Our experimental evaluation demonstrates the efficiency and the efficacy of our framework in multiple settings.

dataset, name change, row and column marginal reveal, (4 more...)

Neural Information Processing Systems

Industry: Information Technology > Services (0.60)

Technology: Information Technology > Artificial Intelligence (0.40)

Add feedback

From Instructions to ODRL Usage Policies: An Ontology Guided Approach

Mustafa, Daham M., Nadgeri, Abhishek, Collarana, Diego, Arnold, Benedikt T., Quix, Christoph, Lange, Christoph, Decker, Stefan

arXiv.org Artificial IntelligenceJun-5-2025

This study presents an approach that uses large language models such as GPT-4 to generate usage policies in the W3C Open Digital Rights Language ODRL automatically from natural language instructions. Our approach uses the ODRL ontology and its documentation as a central part of the prompt. Our research hypothesis is that a curated version of existing ontology documentation will better guide policy generation. We present various heuristics for adapting the ODRL ontology and its documentation to guide an end-to-end KG construction process. We evaluate our approach in the context of dataspaces, i.e., distributed infrastructures for trustworthy data exchange between multiple participating organizations for the cultural domain. We created a benchmark consisting of 12 use cases of varying complexity. Our evaluation shows excellent results with up to 91.95% accuracy in the resulting knowledge graph.

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2506.03301

Country:

Asia (0.47)
Europe > Germany (0.29)

Genre:

Overview (0.94)
Research Report (0.82)

Industry: Information Technology (0.35)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

Non-Gaussianities in Collider Metric Binning

Larkoski, Andrew J.

arXiv.org Artificial IntelligenceMar-5-2025

Metrics for rigorously defining a distance between two events have been used to study the properties of the dataspace manifold of particle collider physics. The probability distribution of pairwise distances on this dataspace is unique with probability 1, and so this suggests a method to search for and identify new physics by the deviation of measurement from a null hypothesis prediction. To quantify the deviation statistically, we directly calculate the probability distribution of the number of event pairs that land in the bin a fixed distance apart. This distribution is not generically Gaussian and the ratio of the standard deviation to the mean entries in a bin scales inversely with the square-root of the number of events in the data ensemble. If the dataspace manifold exhibits some enhanced symmetry, the number of entries is Gaussian, and further fluctuations about the mean scale away like the inverse of the number of events. We define a robust measure of the non-Gaussianity of the bin-by-bin statistics of the distance distribution, and demonstrate in simulated data of jets from quantum chromodynamics sensitivity to the parton-to-hadron transition and that the manifold of events enjoys enhanced symmetries as their energy increases.

arxiv, phy, symmetry, (16 more...)

arXiv.org Artificial Intelligence

2503.03809

Country:

North America > United States > New York > Suffolk County > Hauppauge (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Step Toward Interpretability: Smearing the Likelihood

Larkoski, Andrew J.

arXiv.org Machine LearningJan-13-2025

The problem of interpretability of machine learning architecture in particle physics has no agreed-upon definition, much less any proposed solution. We present a first modest step toward these goals by proposing a definition and corresponding practical method for isolation and identification of relevant physical energy scales exploited by the machine. This is accomplished by smearing or averaging over all input events that lie within a prescribed metric energy distance of one another and correspondingly renders any quantity measured on a finite, discrete dataset continuous over the dataspace. Within this approach, we are able to explicitly demonstrate that (approximate) scaling laws are a consequence of extreme value theory applied to analysis of the distribution of the irreducible minimal distance over which a machine must extrapolate given a finite dataset. As an example, we study quark versus gluon jet identification, construct the smeared likelihood, and show that discrimination power steadily increases as resolution decreases, indicating that the true likelihood for the problem is sensitive to emissions at all scales.

artificial intelligence, arxiv, machine learning, (15 more...)

arXiv.org Machine Learning

2501.07643

Country: North America > United States (0.46)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Towards Enabling FAIR Dataspaces Using Large Language Models

Arnold, Benedikt T., Theissen-Lipp, Johannes, Collarana, Diego, Lange, Christoph, Geisler, Sandra, Curry, Edward, Decker, Stefan

arXiv.org Artificial IntelligenceMar-18-2024

Dataspaces have recently gained adoption across various sectors, including traditionally less digitized domains such as culture. Leveraging Semantic Web technologies helps to make dataspaces FAIR, but their complexity poses a significant challenge to the adoption of dataspaces and increases their cost. The advent of Large Language Models (LLMs) raises the question of how these models can support the adoption of FAIR dataspaces. In this work, we demonstrate the potential of LLMs in dataspaces with a concrete example. We also derive a research agenda for exploring this emerging field.

dataspace, gndo, odrl, (15 more...)

arXiv.org Artificial Intelligence

2403.15451

Country:

Europe > Slovenia > Drava > Municipality of Benedikt > Benedikt (0.06)
South America > Bolivia (0.04)
Europe > Ireland > Connaught > County Galway > Galway (0.04)
Europe > Germany > North Rhine-Westphalia > Cologne Region > Aachen (0.04)

Genre: Research Report (0.50)

Industry: Information Technology (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

Pantypes: Diverse Representatives for Self-Explainable Models

Kjærsgaard, Rune, Boubekki, Ahcène, Clemmensen, Line

arXiv.org Machine LearningMar-14-2024

Prototypical self-explainable classifiers have emerged to meet the growing demand for interpretable AI systems. These classifiers are designed to incorporate high transparency in their decisions by basing inference on similarity with learned prototypical objects. While these models are designed with diversity in mind, the learned prototypes often do not sufficiently represent all aspects of the input distribution, particularly those in low density regions. Such lack of sufficient data representation, known as representation bias, has been associated with various detrimental properties related to machine learning diversity and fairness. In light of this, we introduce pantypes, a new family of prototypical objects designed to capture the full diversity of the input distribution through a sparse set of objects. We show that pantypes can empower prototypical self-explainable models by occupying divergent regions of the latent space and thus fostering high diversity, interpretability and fairness.

diversity, pantype, prototype, (15 more...)

arXiv.org Machine Learning

2403.09383

Country:

Europe > Denmark (0.04)
Europe > Germany (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Sensing and Signal Processing (0.94)

Add feedback

b4d168b48157c623fbd095b4a565b5bb-Paper.pdf

Neural Information Processing SystemsMar-13-2024, 19:51:35 GMT

Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i, j) of interest. We do this for all the cells of H simultaneously, without generating realizations, but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with asymptotic running time the same as that required by standard sampling techniques to generate a single dataset from the same dataspace. Our experimental evaluation demonstrates the efficiency and the efficacy of our framework in multiple settings.

algorithm, dataset, matrix, (16 more...)

Neural Information Processing Systems

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Europe > Denmark > Capital Region > Copenhagen (0.04)

Industry: Information Technology > Services (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models (0.46)

Add feedback

What do row and column marginals reveal about your dataset?

Golshan, Behzad, Byers, John, Terzi, Evimaria

Neural Information Processing SystemsFeb-14-2020, 18:10:21 GMT

Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i,j) of interest. We do this, for all the cells of H simultaneously, without generating realizations but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with running time equal to the time required by standard sampling techniques to generate a single dataset from the same dataspace.

dataset, input marginal, row and column marginal reveal, (1 more...)

Neural Information Processing Systems

Industry: Information Technology > Services (0.64)

Technology: Information Technology > Artificial Intelligence (0.49)

Add feedback

feature engineering and over-fitting • /r/MachineLearning

@machinelearnbotMay-10-2016, 09:23:41 GMT

Over-fitting refers to almost "remembering" the exact data points, rather than learning an intelligent representation of the data. With a neural network of 10.000 hidden units, I can definitely overfit a trainset of 10.000 samples. Simply every hidden neuron can correspond to one input sample. Feature engineering concerns the expansion of your input space. Say you have input vectors of 20 features.

artificial intelligence, feature engineering, machine learning, (16 more...)

@machinelearnbot

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.32)

Add feedback

What do row and column marginals reveal about your dataset?

Golshan, Behzad, Byers, John, Terzi, Evimaria

Neural Information Processing SystemsDec-31-2013

Numerous datasets ranging from group memberships within social networks to purchase histories on e-commerce sites are represented by binary matrices. While this data is often either proprietary or sensitive, aggregated data, notably row and column marginals, is often viewed as much less sensitive, and may be furnished for analysis. Here, we investigate how these data can be exploited to make inferences about the underlying matrix H. Instead of assuming a generative model for H, we view the input marginals as constraints on the dataspace of possible realizations of H and compute the probability density function of particular entries H(i,j) of interest. We do this, for all the cells of H simultaneously, without generating realizations but rather via implicitly sampling the datasets that satisfy the input marginals. The end result is an efficient algorithm with running time equal to the time required by standard sampling techniques to generate a single dataset from the same dataspace. Our experimental evaluation demonstrates the efficiency and the efficacy of our framework in multiple settings.

artificial intelligence, machine learning, matrix, (19 more...)

Neural Information Processing Systems

Industry: Information Technology > Services (0.68)

Technology: