Kabeli, Ori
Aviary: training language agents on challenging scientific tasks
Narayanan, Siddharth, Braza, James D., Griffiths, Ryan-Rhys, Ponnapati, Manu, Bou, Albert, Laurent, Jon, Kabeli, Ori, Wellawatte, Geemi, Cox, Sam, Rodriques, Samuel G., White, Andrew D.
Language agents [1-4] are AI agents [5] that integrate LLMs [6-8] as core components. LLMs excel at zero-shot generalization [9, 10], providing a notable advantage over traditional AI agents, such as those based on handcrafted rules or reinforcement learning, which often struggle to generalize to new environments [11]. While LLMs can exhibit flawed reasoning and logic when used in isolation [12-14], constructing a language agent by grounding LLMs in an environment with observational feedback can mitigate these issues. Early work on language agents used LLMs to directly output actions in the external environment [15-17], while more recently, language agents have been augmented with internal reasoning [18, 19] and planning [20, 21] procedures, as well as long-term memory storage [22, 23]. An emergent research challenge is to pose a theoretical description of the learning problem solved by language agents [4, 24] and to develop efficient methods to optimize the components of a language agent [24-26]. Here, we define common language agent tasks as language decision processes (LDPs) and frame language agents as stochastic computation graphs [27] that may be trained to solve LDPs. We show that pre-existing agents [18, 19, 21] can be implemented within our stochastic computation graph framework and introduce a simple and extensible software package named LDP that enables modular interchange of environments, agents, and optimizers, simplifying experimentation across a variety of settings. These authors jointly supervise technical work at FutureHouse.
Decoding speech perception from non-invasive brain recordings
Défossez, Alexandre, Caucheteux, Charlotte, Rapin, Jérémy, Kabeli, Ori, King, Jean-Rémi
Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in that regard: deep learning algorithms trained on intracranial recordings now start to decode elementary linguistic features (e.g. letters, words, spectrograms). However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here, we introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto- or electro-encephalography (M/EEG), while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and more than 80% in the very best participants - a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model to a variety of baselines highlights the importance of (i) a contrastive objective, (ii) pretrained representations of speech and (iii) a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder's predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk for brain surgery.
Online Self-Attentive Gated RNNs for Real-Time Speaker Separation
Kabeli, Ori, Adi, Yossi, Tang, Zhenyu, Xu, Buye, Kumar, Anurag
Deep neural networks have recently shown great success in the task of blind source separation, both under monaural and binaural settings. Although these methods were shown to produce high-quality separations, they were mainly applied under offline settings, in which the model has access to the full input signal while separating the signal. In this study, we convert a non-causal state-of-the-art separation model into a causal and real-time model and evaluate its performance under both online and offline settings. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65. Samples can be found under the following link: https://kwanum.github.io/sagrnnc-stream-results/.