Natural Language
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation Mengyu Yang 2,3 Leonid Sigal University of British Columbia
The recent success of transformer models in language, such as BERT, has motivated the use of such architectures for multi-modal feature learning and tasks. However, most multi-modal variants (e.g., ViLBERT) have limited themselves to visuallinguistic data. Relatively few have explored its use in audio-visual modalities, and none, to our knowledge, illustrate them in the context of granular audio-visual detection or segmentation tasks such as sound source separation and localization. In this work, we introduce TriBERT - a transformer-based architecture, inspired by ViLBERT, which enables contextual feature learning across three modalities: vision, pose, and audio, with the use of flexible co-attention. The use of pose keypoints is inspired by recent works that illustrate that such representations can significantly boost performance in many audio-visual scenarios where often one or more persons are responsible for the sound explicitly (e.g., talking) or implicitly (e.g., sound produced as a function of human manipulating an object). From a technical perspective, as part of the TriBERT architecture, we introduce a learned visual tokenization scheme based on spatial attention and leverage weak-supervision to allow granular cross-modal interactions for visual and pose modalities. Further, we supplement learning with sound-source separation loss formulated across all three streams. We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.
Supplementary Material for DeWave: Discrete Encoding of EEG Waves for EEG to Text Translation
In this material, we will give more technical details as well as additional experiments to support the main paper. The overview of the proposed framework, DeWave, is illustrated in Figure 6. Ground Bush attended the University of Texas at Austin, where he graduated Phi Beta Kappa with a Truth Bachelor's degree in Latin American Studies in 1973, taking only two and a half years to complete his work, and obtaining generally excellent grades. Predict was the University of California at Austin in where he studied in Beta Kappa in a degree of degree in history American Studies in 1975. ZuCo stands for Zurich Cognitive Language Processing Corpus (ZuCo), a dataset that includes both raw and preprocessed eye-tracking and electroencephalography (EEG) data. The data is collected by having human subjects read given text corpora while simultaneously recording both their eye-tracking signals and EEG waves.
DeWave: Discrete EEG Waves Encoding for Brain Dynamics to Text Translation
The translation of brain dynamics into natural language is pivotal for braincomputer interfaces (BCIs). With the swift advancement of large language models, such as ChatGPT, the need to bridge the gap between the brain and languages becomes increasingly pressing. Current methods, however, require eye-tracking fixations or event markers to segment brain dynamics into word-level features, which can restrict the practical application of these systems. To tackle these issues, we introduce a novel framework, DeWave, that integrates discrete encoding sequences into open-vocabulary EEG-to-text translation tasks. DeWave uses a quantized variational encoder to derive discrete codex encoding and align it with pre-trained language models. This discrete codex representation brings forth two advantages: 1) it realizes translation on raw waves without marker by introducing text-EEG contrastive alignment training, and 2) it alleviates the interference caused by individual differences in EEG waves through an invariant discrete codex with or without markers.
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidencebased deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. To improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream visionlanguage tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets.
Hierarchical Decision Making by Generating and Following Natural Language Instructions
Hengyuan Hu, Denis Yarats, Qucheng Gong, Yuandong Tian, Mike Lewis
We explore using natural language instructions as an expressive and compositional representation of complex actions for hierarchical decision making. Rather than directly selecting micro-actions, our agent first generates a plan in natural language, which is then executed by a separate model. We introduce a challenging real-time strategy game environment in which the actions of a large number of units must be coordinated across long time scales. We gather a dataset of 76 thousand pairs of instructions and executions from human play, and train instructor and executor models. Experiments show that models generate intermediate plans in natural langauge significantly outperform models that directly imitate human actions. The compositional structure of language is conducive to learning generalizable action representations.
We would like to thank all the reviewers for their insightful and constructive feedback. This model can get comparable win rate to the RNN-Discriminative in Table3. Finally, we appreciate the reviewers for suggesting additional citations and interesting future directions. Natural language has several advantages over latent programs. Secondly, gathering supervision for natural language actions is possible with the framework we introduce.
A Proof of the object in Equation 3 is convex, when ฮฑ is sufficiently small
A Proof of the object in Equation 3 is convex, when ฮฑ is sufficiently small. To validate this statement, we first prove two factors in the object are convex (Lemma A.1 and Lemma A.2) and the combination of them keeps the convex property (Lemma A.3). Lemma A.1. Similar to the proof of Lemma A.1, we have ( P and Q are positive semidefinite indicates that i [1..N ], 0 ฮป Thus, P ฮฑQ is positive semidefinite. Combining Lemma A.1, Lemma A.2 and Lemma A.3, the objective of Equation 3 is convex when ฮฑ is small. We assume the observation of the triggered watermark words are independent to each other, as those words are sparsely distributed in our corpus (4 per 1000 words).
CATER: Intellectual Property Protection on Text Generation APIs via Conditional Watermarks
Previous works have validated that text generation APIs can be stolen through imitation attacks, causing IP violations. In order to protect the IP of text generation APIs, recent work has introduced a watermarking algorithm and utilized the null-hypothesis test as a post-hoc ownership verification on the imitation models. However, we find that it is possible to detect those watermarks via sufficient statistics of the frequencies of candidate watermarking words. To address this drawback, in this paper, we propose a novel Conditional wATERmarking framework (CATER) for protecting the IP of text generation APIs. An optimization method is proposed to decide the watermarking rules that can minimize the distortion of overall word distributions while maximizing the change of conditional word selections. Theoretically, we prove that it is infeasible for even the savviest attacker (they know how CATER works) to reveal the used watermarks from a large pool of potential word pairs based on statistical inspection. Empirically, we observe that high-order conditions lead to an exponential growth of suspicious (unused) watermarks, making our crafted watermarks more stealthy. In addition, CATER can effectively identify IP infringement under architectural mismatch and cross-domain imitation attacks, with negligible impairments on the generation quality of victim APIs. We envision our work as a milestone for stealthily protecting the IP of text generation APIs.
Zero-Resource Knowledge-Grounded Dialogue Generation Wei Wu Peking University Microsoft STCA Meituan Yufan Zhao Xueliang Zhao Chongyang Tao Microsoft STCA Peking University
While neural conversation models have shown great potentials towards generating informative and engaging responses via introducing external knowledge, learning such a model often requires knowledge-grounded dialogues that are difficult to obtain. To overcome the data challenge and reduce the cost of building a knowledgegrounded dialogue system, we explore the problem under a zero-resource setting by assuming no context-knowledge-response triples are needed for training. To this end, we propose representing the knowledge that bridges a context and a response and the way that the knowledge is expressed as latent variables, and devise a variational approach that can effectively estimate a generation model from a dialogue corpus and a knowledge corpus that are independent with each other. Evaluation results on three benchmarks of knowledge-grounded dialogue generation indicate that our model can achieve comparable performance with stateof-the-art methods that rely on knowledge-grounded dialogues for training, and exhibits a good generalization ability over different topics and different datasets.