Goto

Collaborating Authors

 Supervised Learning


Fairness constraints can help exact inference in structured prediction

Neural Information Processing Systems

Many inference problems in structured prediction can be modeled as maximizing a score function on a space of labels, where graphs are a natural representation to decompose the total score into a sum of unary (nodes) and pairwise (edges) scores. Given a generative model with an undirected connected graph G and true vector of binary labels y, it has been previously shown that when G has good expansion properties, such as complete graphs or d-regular expanders, one can exactly recover y (with high probability and in polynomial time) from a single noisy observation of each edge and node. We analyze the previously studied generative model by Globerson et al. (2015) under a notion of statistical parity. That is, given a fair binary node labeling, we ask the question whether it is possible to recover the fair assignment, with high probability and in polynomial time, from single edge and node observations. We find that, in contrast to the known trade-offs between fairness and model performance, the addition of the fairness constraint improves the probability of exact recovery. We effectively explain this phenomenon and empirically show how graphs with poor expansion properties, such as grids, are now capable of achieving exact recovery. Finally, as a byproduct of our analysis, we provide a tighter minimum-eigenvalue bound than that which can be derived from Weyl's inequality.


Modeling Heterogeneous Hierarchies with Relation-specific Hyperbolic Cones

Neural Information Processing Systems

Hierarchical relations are prevalent and indispensable for organizing human knowledge captured by a knowledge graph (KG). The key property of hierarchical relations is that they induce a partial ordering over the entities, which needs to be modeled in order to allow for hierarchical reasoning. However, current KG embeddings can model only a single global hierarchy (single global partial ordering) and fail to model multiple heterogeneous hierarchies that exist in a single KG. Here we present ConE (Cone Embedding), a KG embedding model that is able to simultaneously model multiple hierarchical as well as non-hierarchical relations in a knowledge graph.


Continual Learning in Low-rank Orthogonal Subspaces, Philip H. S. Torr 1

Neural Information Processing Systems

In continual learning (CL), a learner is faced with a sequence of tasks, arriving one after the other, and the goal is to remember all the tasks once the continual learning experience is finished. The prior art in CL uses episodic memory, parameter regularization or extensible network structures to reduce interference among tasks, but in the end, all the approaches learn different tasks in a joint vector space. We believe this invariably leads to interference among different tasks. We propose to learn tasks in different (low-rank) vector subspaces that are kept orthogonal to each other in order to minimize interference. Further, to keep the gradients of different tasks coming from these subspaces orthogonal to each other, we learn isometric mappings by posing network training as an optimization problem over the Stiefel manifold. To the best of our understanding, we report, for the first time, strong results over experience-replay baseline with and without memory on standard classification benchmarks in continual learning.


RGMDT: Return-Gap-Minimizing Decision Tree Extraction in Non-Euclidean Metric Space

Neural Information Processing Systems

Deep Reinforcement Learning (DRL) algorithms have achieved great success in solving many challenging tasks while their black-box nature hinders interpretability and real-world applicability, making it difficult for human experts to interpret and understand DRL policies. Existing works on interpretable reinforcement learning have shown promise in extracting decision tree (DT) based policies from DRL policies with most focus on the single-agent settings while prior attempts to introduce DT policies in multi-agent scenarios mainly focus on heuristic designs which do not provide any quantitative guarantees on the expected return. In this paper, we establish an upper bound on the return gap between the oracle expert policy and an optimal decision tree policy. This enables us to recast the DT extraction problem into a novel non-euclidean clustering problem over the local observation and action values space of each agent, with action values as cluster labels and the upper bound on the return gap as clustering loss. Both the algorithm and the upper bound are extended to multi-agent decentralized DT extractions by an iteratively-grow-DT procedure guided by an action-value function conditioned on the current DTs of other agents. Further, we propose the Return-Gap-Minimization Decision Tree (RGMDT) algorithm, which is a surprisingly simple design and is integrated with reinforcement learning through the utilization of a novel Regularized Information Maximization loss. Evaluations on tasks like D4RL show that RGMDT significantly outperforms heuristic DT-based baselines and can achieve nearly optimal returns under given DT complexity constraints (e.g., maximum number of DT nodes).


Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs

arXiv.org Artificial Intelligence

We propose a new algorithm for fine-tuning large language models using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B language models. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.


A Simplicial Complexes, Cycles, Barcodes

Neural Information Processing Systems

A.1 Background The simplicial complex is a combinatorial data that can be thought of as a higher-dimensional generalization of a graph. Simplicial complex S is a collection of k simplices, which are finite (k + 1) elements subsets in a given set V, for k 0. The collection of simplices S must satisfy the condition that for each ฯƒ S, ฯƒ S. A simplicial complex consisting only of 0 and 1 simplices is a graph. Let (ฮ“, m) be a weighted graph with distance-like weights, where m is the symmetric matrix of the weights attached to the edges of the graph ฮ“. Even though such weighted graphs do not always come from a set of points in metric space, barcodes of weighted graphs have been successfully applied in many situations (networks, fmri, medical data, graph's classification etc). A.2 Simplices, describing discrepancies between the two manifolds Here we gather more details on the construction of sets of simplicies that describe discrepancies between two point clouds P and Q sampled from the two distributions P, Q.


Supplementary Material: Structured Prediction for Conditional Meta-Learning

Neural Information Processing Systems

The Appendix is organized in two main parts: Appendix A proves the formal version of Theorem 1 and provides additional details on the connection between structured prediction and conditional meta-learning investigated in this work. Appendix B provides additional details on the model hyperparameters and additional experimental evaluation. We first recall the general formulation of the structured prediction approach in [13], followed by showing how the conditional meta-learning problem introduced in Section 3 can be cast within this setting. A.1 General Structured Prediction In this section, we borrow from the notation of [13]. Consider X, Y and Z three spaces, respectively the input, label and output sets of our problem.


INDIGO: GNN-Based Inductive Knowledge Graph Completion Using Pair-Wise Encoding

Neural Information Processing Systems

The aim of knowledge graph (KG) completion is to extend an incomplete KG with missing triples. Popular approaches based on graph embeddings typically work by first representing the KG in a vector space, and then applying a predefined scoring function to the resulting vectors to complete the KG. These approaches work well in transductive settings, where predicted triples involve only constants seen during training; however, they are not applicable in inductive settings, where the KG on which the model was trained is extended with new constants or merged with other KGs. The use of Graph Neural Networks (GNNs) has recently been proposed as a way to overcome these limitations; however, existing approaches do not fully exploit the capabilities of GNNs and still rely on heuristics and adhoc scoring functions. In this paper, we propose a novel approach, where the KG is fully encoded into a GNN in a transparent way, and where the predicted triples can be read out directly from the last layer of the GNN without the need for additional components or scoring functions. Our experiments show that our model outperforms state-of-the-art approaches on inductive KG completion benchmarks.


How much do LLMs learn from negative examples?

arXiv.org Artificial Intelligence

Large language models (LLMs) undergo a three-phase training process: unsupervised pre-training, supervised fine-tuning (SFT), and learning from human feedback (RLHF/DPO). Notably, it is during the final phase that these models are exposed to negative examples -- incorrect, rejected, or suboptimal responses to queries. This paper delves into the role of negative examples in the training of LLMs, using a likelihood-ratio (Likra) model on multiple-choice question answering benchmarks to precisely manage the influence and the volume of negative examples. Our findings reveal three key insights: (1) During a critical phase in training, Likra with negative examples demonstrates a significantly larger improvement per training example compared to SFT using only positive examples. This leads to a sharp jump in the learning curve for Likra unlike the smooth and gradual improvement of SFT; (2) negative examples that are plausible but incorrect (near-misses) exert a greater influence; and (3) while training with positive examples fails to significantly decrease the likelihood of plausible but incorrect answers, training with negative examples more accurately identifies them. These results indicate a potentially significant role for negative examples in improving accuracy and reducing hallucinations for LLMs.


Metric Space Magnitude for Evaluating the Diversity of Latent Representations

Neural Information Processing Systems

The magnitude of a metric space is a novelinvariant that provides a measure of the'effective size' of a space acrossmultiple scales, while also capturing numerous geometrical properties, such as curvature, density, or entropy.We develop a family of magnitude-based measures of the intrinsicdiversity of latent representations, formalising a novel notion ofdissimilarity between magnitude functions of finite metric spaces.Our measures are provably stable under perturbations of the data, can beefficiently calculated, and enable a rigorous multi-scale characterisation and comparison oflatent representations. We show their utility and superior performance across different domains and tasks, includingthe automated estimation of diversity,the detection of mode collapse, andthe evaluation of generative models for text, image, and graph data.