AITopics | relational data

Group-Level Data Selection for Efficient Pretraining

Neural Information Processing SystemsJun-23-2026, 00:02:29 GMT

The efficiency and quality of language model pretraining are largely determined by the way pretraining data are selected. In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we within partition each the cluster dataset independently into small clusters . Experiments using relationship on DCLM weights 400M-4x, and 1B-1x, select data and 3B-1x show that Group-MATES achieves 3.5%-9.4%

large language model, machine learning, selection, (20 more...)

Neural Information Processing Systems

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)
Overview (0.67)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Group-Level Data Selection for Efficient Pretraining

Neural Information Processing SystemsJun-14-2026, 06:12:57 GMT

The efficiency and quality of language model pretraining are largely determined by the way pretraining data are selected. In this paper, we introduce, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we partition the dataset into small clusters using relationship weights and select data within each cluster independently.

artificial intelligence, machine learning, proceedings, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.39)

Add feedback

Differentially Private Relational Learning with Entity-level Privacy Guarantees

Neural Information Processing SystemsJun-12-2026, 22:45:45 GMT

Learning with relational and network-structured data is increasingly vital in sensitive domains where protecting the privacy of individual entities is paramount. Differential Privacy (DP) offers a principled approach for quantifying privacy risks, with DP-SGD emerging as a standard mechanism for private model training. However, directly applying DP-SGD to relational learning is challenging due to two key factors: (i) entities often participate in multiple relations, resulting in high and difficult-to-control sensitivity; and (ii) relational learning typically involves multi-stage, potentially coupled (interdependent) sampling procedures that make standard privacy amplification analyses inapplicable. This work presents a principled framework for relational learning with formal entity-level DP guarantees. We provide a rigorous sensitivity analysis and introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency. We also extend the privacy amplification results to a tractable subclass of coupled sampling, where the dependence arises only through sample sizes. These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees. Experiments on fine-tuning text encoders over text-attributed network-structured relational data demonstrate the strong utility-privacy trade-offs of our approach.

artificial intelligence, machine learning, proceedings, (5 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.89)
Information Technology > Databases (0.83)

Add feedback

029f82afd78288059dc946b105c451fd-Paper-Conference.pdf

Neural Information Processing SystemsApr-24-2026, 08:14:41 GMT

artificial intelligence, data mining, machine learning, (19 more...)

Neural Information Processing Systems

Country:

North America > United States (1.00)
Europe (1.00)

Genre: Research Report (0.69)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
Information Technology > Databases (0.71)
Information Technology > Data Science > Data Mining (0.67)

Add feedback

e607b1419e9ae7cd5cb5b5bb60c2ad5c-Supplemental-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-18-2026, 12:15:03 GMT

artificial intelligence, information management, natural language, (18 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.06)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.94)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Information Management (0.94)
Information Technology > Security & Privacy (0.69)

Add feedback

Scalable Deep Generative Relational Model with High-Order Node Dependence

Xuhui Fan, Bin Li, Caoyuan Li, Scott SIsson, Ling Chen

Neural Information Processing SystemsFeb-13-2026, 08:21:02 GMT

Neural Information Processing Systems http://nips.cc/

deep network architecture, relational data, sdrem, (12 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Oceania > Australia > New South Wales (0.04)
North America > Canada (0.04)
Asia > Middle East > Jordan (0.04)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.70)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.47)

Add feedback

Coresets for Relational Data and The Applications

Neural Information Processing SystemsDec-23-2025, 16:43:28 GMT

A coreset is a small set that can approximately preserve the structure of the original input data set. Therefore we can run our algorithm on a coreset so as to reduce the total computational complexity. Conventional coreset techniques assume that the input data set is available to process explicitly. However, this assumption may not hold in real-world scenarios. In this paper, we consider the problem of coresets construction over relational data. Namely, the data is decoupled into several relational tables, and it could be very expensive to directly materialize the data matrix by joining the tables. We propose a novel approach called ``aggregation tree with pseudo-cube'' that can build a coreset from bottom to up. Moreover, our approach can neatly circumvent several troublesome issues of relational learning problems [Khamis et al., PODS 2019]. Under some mild assumptions, we show that our coreset approach can be applied for the machine learning tasks, such as clustering, logistic regression and SVM.

coreset, name change, relational data, (5 more...)

Neural Information Processing Systems

Genre: Research Report (0.84)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Relational Causal Discovery with Latent Confounders

Negro, Matteo, Piras, Andrea, Ahsan, Ragib, Arbour, David, Zheleva, Elena

arXiv.org Artificial IntelligenceNov-5-2025

Estimating causal effects from real-world relational data can be challenging when the underlying causal model and potential confounders are unknown. While several causal discovery algorithms exist for learning causal models with latent confounders from data, they assume that the data is independent and identically distributed (i.i.d.) and are not well-suited for learning from relational data. Similarly, existing relational causal discovery algorithms assume causal sufficiency, which is unrealistic for many real-world datasets. To address this gap, we propose RelFCI, a sound and complete causal discovery algorithm for relational data with latent confounders. Our work builds upon the Fast Causal Inference (FCI) and Relational Causal Discovery (RCD) algorithms and it defines new graphical models, necessary to support causal discovery in relational domains. We also establish soundness and completeness guarantees for relational d-separation with latent confounders. We present experimental results demonstrating the effectiveness of RelFCI in identifying the correct causal structure in relational causal models with latent confounders.

artificial intelligence, dependency, machine learning, (16 more...)

arXiv.org Artificial Intelligence

doi: 10.5555/3762387.3762521

2507.017

Country: North America > United States (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.48)

Add feedback

Supplementary Material of ST ARK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases Website/Platform and Hosting

Neural Information Processing SystemsOct-10-2025, 19:49:35 GMT

We provide a persistent dereferenceable identifier DOI: https://doi.org/10.57967/hf/2530. RK retrieval datasets are under license CC-BY -4.0 as stated in our website. We will maintain our GitHub repository will pull requests and open issues. Code: We have provided the complete codebase in our GitHub repository. Evaluation Procedures: All evaluation procedures are thoroughly documented.

dataset, please describe, please provide, (13 more...)

Neural Information Processing Systems

Country: North America > United States > California > Santa Clara County > Palo Alto (0.06)

Industry:

Law (1.00)
Information Technology > Security & Privacy (0.94)
Government (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.52)
Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.43)

Add feedback

Scalable Deep Generative Relational Model with High-Order Node Dependence

Xuhui Fan, Bin Li, Caoyuan Li, Scott SIsson, Ling Chen

Neural Information Processing SystemsOct-3-2025, 09:37:21 GMT

We propose a probabilistic framework for modelling and exploring the latent structure of relational data. Given feature information for the nodes in a network, the scalable deep generative relational model (SDREM) builds a deep network architecture that can approximate potential nonlinear mappings between nodes' feature information and the nodes' latent representations. Our contribution is two-fold: (1) We incorporate high-order neighbourhood structure information to generate the latent representations at each node, which vary smoothly over the network.

deep network architecture, relational data, sdrem, (12 more...)

Neural Information Processing Systems

Country: