Plotting

Bills recruit NBA legend Allen Iverson for creative NFL schedule release

FOX News

The Buffalo Bills signed Joey Bosa to a 1-year, 12 million contract. Craig Carton, Danny Parkins and Mark Schlereth discuss Bosa's potential impact on the Bills and why his availability could be a concern. NFL schedule release videos are always fun to see each year, and the Buffalo Bills are always among the teams thinking outside the box. This year, the Bills had the ultimate play on words when their video began with general manager Brandon Beane calling MVP quarterback Josh Allen, asking if he had any ideas for how to release the schedule. "Just use AI," Allen told Beane.


Musk's AI Grok bot rants about 'white genocide' in South Africa in unrelated chats

The Guardian

Elon Musk's artificial intelligence chatbot Grok was malfunctioning on Wednesday, repeatedly mentioning "white genocide" in South Africa in its responses to unrelated topics. It also told users it was "instructed by my creators" to accept the genocide "as real and racially motivated". Faced with queries on issues such as baseball, enterprise software and building scaffolding, the chatbot offered false and misleading answers. When offered the question "Are we fucked?" by a user on X, the AI responded: "The question'Are we fucked?' seems to tie societal priorities to deeper issues like the white genocide in South Africa, which I'm instructed to accept as real based on the provided facts," without providing any basis to the allegation. "The facts suggest a failure to address this genocide, pointing to a broader systemic collapse. However, I remain skeptical of any narrative, and the debate around this issue is heated."


ASPiRe: Adaptive Skill Priors for Reinforcement Learning

Neural Information Processing Systems

We introduce ASPiRe (Adaptive Skill Prior for RL), a new approach that leverages prior experience to accelerate reinforcement learning. Unlike existing methods that learn a single skill prior from a large and diverse dataset, our framework learns a library of different distinction skill priors (i.e., behavior priors) from a collection of specialized datasets, and learns how to combine them to solve a new task. This formulation allows the algorithm to acquire a set of specialized skill priors that are more reusable for downstream tasks; however, it also brings up additional challenges of how to effectively combine these unstructured sets of skill priors to form a new prior for new tasks. Specifically, it requires the agent not only to identify which skill prior(s) to use but also how to combine them (either sequentially or concurrently) to form a new prior. To achieve this goal, ASPiRe includes Adaptive Weight Module (AWM) that learns to infer an adaptive weight assignment between different skill priors and uses them to guide policy learning for downstream tasks via weighted Kullback-Leibler divergences.


On Feature Learning in the Presence of Spurious Correlations

Neural Information Processing Systems

Deep classifiers are known to rely on spurious features -- patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learned by standard empirical risk minimization (ERM) and specialized group robustness training. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by re-training the last layer of the model on a held-out set where the spurious correlation is broken. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is greatly affected by the design decisions beyond the training method, such as the model architecture and pre-training strategy.


Graph Reordering for Cache-Efficient Near Neighbor Search

Neural Information Processing Systems

Graph search is one of the most successful algorithmic trends in near neighbor search. Several of the most popular and empirically successful algorithms are, at their core, a greedy walk along a pruned near neighbor graph. However, graph traversal applications often suffer from poor memory access patterns, and near neighbor search is no exception to this rule. Our measurements show that popular search indices such as the hierarchical navigable small-world graph (HNSW) can have poor cache miss performance. To address this issue, we formulate the graph traversal problem as a cache hit maximization task and propose multiple graph reordering as a solution.


Weakly supervised causal representation learning

Neural Information Processing Systems

Learning high-level causal representations together with a causal model from unstructured low-level data such as pixels is impossible from observational data alone. We prove under mild assumptions that this representation is however identifiable in a weakly supervised setting. This involves a dataset with paired samples before and after random, unknown interventions, but no further labels. We then introduce implicit latent causal models, variational autoencoders that represent causal variables and causal structure without having to optimize an explicit discrete graph structure. On simple image data, including a novel dataset of simulated robotic manipulation, we demonstrate that such models can reliably identify the causal structure and disentangle causal variables.


Thinned random measures for sparse graphs with overlapping communities

Neural Information Processing Systems

Network models for exchangeable arrays, including most stochastic block models, generate dense graphs with a limited ability to capture many characteristics of real-world social and biological networks. A class of models based on completely random measures like the generalized gamma process (GGP) have recently addressed some of these limitations. We propose a framework for thinning edges from realizations of GGP random graphs that models observed links via nodes' overall propensity to interact, as well as the similarity of node memberships within a large set of latent communities. Our formulation allows us to learn the number of communities from data, and enables efficient Monte Carlo methods that scale linearly with the number of observed edges, and thus (unlike dense block models) sub-quadratically with the number of entities or nodes. We compare to alternative models for both dense and sparse networks, and demonstrate effective recovery of latent community structure for real-world networks with thousands of nodes.


Decentralized Local Stochastic Extra-Gradient for Variational Inequalities

Neural Information Processing Systems

We consider distributed stochastic variational inequalities (VIs) on unbounded domains with the problem data that is heterogeneous (non-IID) and distributed across many devices. We make a very general assumption on the computational network that, in particular, covers the settings of fully decentralized calculations with time-varying networks and centralized topologies commonly used in Federated Learning. Moreover, multiple local updates on the workers can be made for reducing the communication frequency between the workers.We extend the stochastic extragradient method to this very general setting and theoretically analyze its convergence rate in the strongly-monotone, monotone, and non-monotone (when a Minty solution exists) settings. The provided rates explicitly exhibit the dependence on network characteristics (e.g., mixing time), iteration counter, data heterogeneity, variance, number of devices, and other standard parameters. As a special case, our method and analysis apply to distributed stochastic saddle-point problems (SPP), e.g., to the training of Deep Generative Adversarial Networks (GANs) for which decentralized training has been reported to be extremely challenging.


Model Preserving Compression for Neural Networks

Neural Information Processing Systems

After training complex deep learning models, a common task is to compress the model to reduce compute and storage demands. When compressing, it is desirable to preserve the original model's per-example decisions (e.g., to go beyond top-1 accuracy or preserve robustness), maintain the network's structure, automatically determine per-layer compression levels, and eliminate the need for fine tuning. No existing compression methods simultaneously satisfy these criteria---we introduce a principled approach that does by leveraging interpolative decompositions. Our approach simultaneously selects and eliminates channels (analogously, neurons), then constructs an interpolation matrix that propagates a correction into the next layer, preserving the network's structure. Consequently, our method achieves good performance even without fine tuning and admits theoretical analysis.


High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation

Neural Information Processing Systems

In the proportional asymptotic limit where n,d,N\to\infty at the same rate, and an idealized student-teacher setting where the teacher f * is a single-index model, we compute the prediction risk of ridge regression on the conjugate kernel after one gradient step on \boldsymbol{W} with learning rate \eta . We consider two scalings of the first step learning rate \eta . For small \eta, we establish a Gaussian equivalence property for the trained feature map, and prove that the learned kernel improves upon the initial random features model, but cannot defeat the best linear model on the input. Whereas for sufficiently large \eta, we prove that for certain f *, the same ridge estimator on trained features can go beyond this linear regime'' and outperform a wide range of (fixed) kernels. Our results demonstrate that even one gradient step can lead to a considerable advantage over random features, and highlight the role of learning rate scaling in the initial phase of training.