Goto

Collaborating Authors


Should you still learn a second language if AI can translate for you?

New Scientist

AI translation apps can help you connect with other people – but at what cost? I have long remembered a conversation I had 20 years ago with one of my professors, an expert in what we then called artificial intelligence, which, in many ways, is wildly different to what we now call AI. In this exchange, he confidently told me there was no point learning a second language. Computers would soon erase language barriers, he said.


Why we forget our childhoods

Popular Science

Breakthroughs, discoveries, and DIY tips sent every weekday. My earliest memories are more like nostalgic flickers. The candle I burned my finger on. The plastic toy set that occupied my playtime. These disparate and vague recollections are all most of us can remember of our first years of life.


Qualitative Mechanism Independence

Neural Information Processing Systems

We define what it means for a joint probability distribution to be (QIM-)compatible with a set of independent causal mechanisms, at a qualitative level--or, more precisely, with a directed hypergraph A, which is the qualitative structure of a probabilistic dependency graph (PDG). When A represents a qualitative Bayesian network, QIM-compatibility with A reduces to satisfying the appropriate conditional independencies. But giving semantics to hypergraphs using QIM-compatibility lets us do much more. For one thing, we can capture functional dependencies. For another, QIM-compatibility captures important aspects of causality: we can use compatibility to understand cyclic causal graphs, and to demonstrate compatibility is essentially to produce a causal model. Finally, compatibility has deep connections to information theory. Applying compatibility to cyclic structures helps to clarify a longstanding conceptual issue in information theory.


Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation

Neural Information Processing Systems

Increased training parameters have enabled large pre-trained models to excel in various downstream tasks. Nevertheless, the extensive computational requirements associated with these models hinder their widespread adoption within the community. We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model, facilitating the transfer of knowledge of large models. In contrast to much of the previous work, we scale up the parameters of the student model during training, to benefit from overparameterization without increasing the inference latency. In particular, we propose a tensor decomposition strategy that effectively over-parameterizes the relatively small student model through an efficient and nearly lossless decomposition of its parameter matrices into higher-dimensional tensors. To ensure efficiency, we further introduce a tensor constraint loss to align the high-dimensional tensors between the student and teacher models.


Appendix: On Infinite-Width Hypernetworks

Neural Information Processing Systems

The variance was computed empirically over k = 100 normally distributed samples w. As can be seen, the variance of the kernel tends to zero only when both widths increase. In addition, convergence of the width-limit kernel is guaranteed only when the widths of both networks increase, highlighting the importance of wide architectures for both the hyper and implicit networks for stable training. The hyperkernel used corresponds to the infinite width limit of the same architecture. For the input of g, we used random Fourier features [8] of the pixel coordinates as inputs for both the hyperkernel and the hypernetwork.


On Infinite-Width Hypernetworks

Neural Information Processing Systems

Hypernetworks are architectures that produce the weights of a task-specific primary network. A notable application of hypernetworks in the recent literature involves learning to output functional representations. In these scenarios, the hypernetwork learns a representation corresponding to the weights of a shallow MLP, which typically encodes shape or image information. While such representations have seen considerable success in practice, they remain lacking in the theoretical guarantees in the wide regime of the standard architectures. In this work, we study wide over-parameterized hypernetworks. We show that unlike typical architectures, infinitely wide hypernetworks do not guarantee convergence to a global minima under gradient descent. We further show that convexity can be achieved by increasing the dimensionality of the hypernetwork's output, to represent wide MLPs. In the dually infinite-width regime, we identify the functional priors of these architectures by deriving their corresponding GP and NTK kernels, the latter of which we refer to as the hyperkernel. As part of this study, we make a mathematical contribution by deriving tight bounds on high order Taylor expansion terms of standard fully connected ReLU networks.


CALANet: Cheap All-Layer Aggregation for Human Activity Recognition Jaegyun Park 1, Dae-Won Kim

Neural Information Processing Systems

With the steady growth of sensing technology and wearable devices, sensor-based human activity recognition has become essential in widespread applications, such as healthcare monitoring and fitness tracking, where accurate and real-time systems are required. To achieve real-time response, recent studies have focused on lightweight neural network models. Specifically, they designed the network architectures by restricting the number of layers shallowly or connections of each layer. However, these approaches suffer from limited accuracy because the classifier only uses the features at the last layer. In this study, we propose a cheap all-layer aggregation network, CALANet, for accuracy improvement while maintaining the efficiency of existing real-time HAR models. Specifically, CALANet allows the classifier to aggregate the features for all layers, resulting in a performance gain. In addition, this work proves that the theoretical computation cost of CALANet is equivalent to that of conventional networks. Evaluated on seven publicly available datasets, CALANet outperformed existing methods, achieving state-of-the-art performance.


Training Compute-Optimal Protein Language Models, Pan Li

Neural Information Processing Systems

We explore optimally training protein language models, an area of significant interest in biological research where guidance on best practices is limited. Most models are trained with extensive compute resources until performance gains plateau, focusing primarily on increasing model sizes rather than optimizing the efficient compute frontier that balances performance and compute budgets. Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives. First, we observed the effect of diminishing returns for the Causal Language Model (CLM) and that of overfitting for the Masked Language Model (MLM) when repeating the commonly used Uniref database. To address this, we included metagenomic protein sequences in the training set to increase the diversity and avoid the plateau or overfitting effects. Second, we obtained the scaling laws of CLM and MLM on Transformer, tailored to the specific characteristics of protein sequence data. Third, we observe a transfer scaling phenomenon from CLM to MLM, further demonstrating the effectiveness of transfer through scaling behaviors based on estimated Effectively Transferred Tokens. Finally, to validate our scaling laws, we compare the large-scale versions of ESM-2 and PROGEN2 on downstream tasks, encompassing evaluations of protein generation as well as structure-and function-related tasks, all within less or equivalent pre-training compute budgets.


address here the most crucial comments in groups (R1, R2, R3 and R4 denote concerns raised by the corresponding

Neural Information Processing Systems

We warmly thank the four reviewers for their work and constructive feed-backs. In the revised paper, we will of course do our best to address all reviewers' comments using the additional This will be clarified in introduction and Isolet interpretation. The reasons of that choice will be incorporated in the section concerning the NE family. We see Ex. 1: Random labels digits Some of the many supervised indicators of the literature will be added. Other issues (R1, R3, R4) The suggested references will be incorporated.