Not enough data to create a plot.
Try a different view from the menu above.
Lee, Moontae
When to Read Documents or QA History: On Unified and Selective Open-domain QA
Lee, Kyungjae, Han, Sang-eun, Hwang, Seung-won, Lee, Moontae
Figure 1 illustrates the distinction of Open-domain question answering is a well-known our approach providing both knowledge to a unified task in natural language processing, aiming to answer reader as context. We retrieve a list of relevant factoid questions from an open set of domains. QA-pairs (called as QA-history), then treat the One commonly used approach for this task is the few retrieved QA examples, as if it is a relevant retrieve-then-read pipeline (also known as Openbook document passage. QA) to retrieve relevant knowledge, then reason Meanwhile, the closest approach to use multiple answers over the knowledge. Given the wide knowledge sources is concatenating the multisources range of topics that open-domain questions can uniformly into a single decoder (Oguz cover, a key to a successful answering model is: et al., 2020), but we argue knowledge selection is to access and utilize diverse knowledge sources critically missing. To motivate, Figure 1 shows the effectively. QA-history, from which answer'Eric Liddell' is Toward this goal, existing work can be categorized explicitly identified, while it is more implicit in the by the knowledge source used: document such that another name such as'Hugh Hudson' is known to often confuse QA models. It Document Corpus-based QA (Doc-QA): This is critical for the QA model to calibrate prediction type of work utilizes a general-domain Document quality as an indicator to decide when to use a Corpus (e.g., Wikipedia) (Karpukhin
Unsupervised Task Graph Generation from Instructional Video Transcripts
Logeswaran, Lajanugen, Sohn, Sungryull, Jang, Yunseok, Lee, Moontae, Lee, Honglak
This work explores the problem of generating task graphs of real-world activities. Different from prior formulations, we consider a setting where text transcripts of instructional videos performing a real-world activity (e.g., making coffee) are provided and the goal is to identify the key steps relevant to the task as well as the dependency relationship between these key steps. We propose a novel task graph generation approach that combines the reasoning capabilities of instruction-tuned language models along with clustering and ranking components to generate accurate task graphs in a completely unsupervised manner. We show that the proposed approach generates more accurate task graphs compared to a supervised learning approach on tasks from the ProceL and CrossTask datasets.
Rebalancing Batch Normalization for Exemplar-based Class-Incremental Learning
Cha, Sungmin, Cho, Sungjun, Hwang, Dasol, Hong, Sunwon, Lee, Moontae, Moon, Taesup
Batch Normalization (BN) and its variants has been extensively studied for neural nets in various computer vision tasks, but relatively little work has been dedicated to studying the effect of BN in continual learning. To that end, we develop a new update patch for BN, particularly tailored for the exemplar-based class-incremental learning (CIL). The main issue of BN in CIL is the imbalance of training data between current and past tasks in a mini-batch, which makes the empirical mean and variance as well as the learnable affine transformation parameters of BN heavily biased toward the current task -- contributing to the forgetting of past tasks. While one of the recent BN variants has been developed for "online" CIL, in which the training is done with a single epoch, we show that their method does not necessarily bring gains for "offline" CIL, in which a model is trained with multiple epochs on the imbalanced training data. The main reason for the ineffectiveness of their method lies in not fully addressing the data imbalance issue, especially in computing the gradients for learning the affine transformation parameters of BN. Accordingly, our new hyperparameter-free variant, dubbed as Task-Balanced BN (TBBN), is proposed to more correctly resolve the imbalance issue by making a horizontally-concatenated task-balanced batch using both reshape and repeat operations during training. Based on our experiments on class incremental learning of CIFAR-100, ImageNet-100, and five dissimilar task datasets, we demonstrate that our TBBN, which works exactly the same as the vanilla BN in the inference time, is easily applicable to most existing exemplar-based offline CIL algorithms and consistently outperforms other BN variants.
Towards More Objective Evaluation of Class Incremental Learning: Representation Learning Perspective
Cha, Sungmin, Kwak, Jihwan, Shim, Dongsub, Kim, Hyunwoo, Lee, Moontae, Lee, Honglak, Moon, Taesup
Class incremental learning (CIL) is the process of continually learning new object classes from incremental data while not forgetting past learned classes. While the common method for evaluating CIL algorithms is based on average test accuracy for all learned classes, we argue that maximizing accuracy alone does not necessarily lead to effective CIL algorithms. In this paper, we experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning and propose a new analysis method. Our experiments show that most state-of-the-art algorithms prioritize high stability and do not significantly change the learned representation, and sometimes even learn a representation of lower quality than a naive baseline. However, we observe that these algorithms can still achieve high test accuracy because they learn a classifier that is closer to the optimal classifier. We also found that the base model learned in the first task varies in representation quality across different algorithms, and changes in the final performance were observed when each algorithm was trained under similar representation quality of the base model. Thus, we suggest that representation-level evaluation is an additional recipe for more objective evaluation and effective development of CIL algorithms.
Learning to Unlearn: Instance-wise Unlearning for Pre-trained Classifiers
Cha, Sungmin, Cho, Sungjun, Hwang, Dasol, Lee, Honglak, Moon, Taesup, Lee, Moontae
Since the recent advent of regulations for data protection (e.g., the General Data Protection Regulation), there has been increasing demand in deleting information learned from sensitive data in pre-trained models without retraining from scratch. The inherent vulnerability of neural networks towards adversarial attacks and unfairness also calls for a robust method to remove or correct information in an instance-wise fashion, while retaining the predictive performance across remaining data. To this end, we define instance-wise unlearning, of which the goal is to delete information on a set of instances from a pre-trained model, by either misclassifying each instance away from its original prediction or relabeling the instance to a different label. We also propose two methods that reduce forgetting on the remaining data: 1) utilizing adversarial examples to overcome forgetting at the representation-level and 2) leveraging weight importance metrics to pinpoint network parameters guilty of propagating unwanted information. Both methods only require the pre-trained model and data instances to forget, allowing painless application to real-life settings where the entire training set is unavailable. Through extensive experimentation on various image classification benchmarks, we show that our approach effectively preserves knowledge of remaining data while unlearning given instances in both single-task and continual unlearning scenarios.
Multimodal Subtask Graph Generation from Instructional Videos
Jang, Yunseok, Sohn, Sungryull, Logeswaran, Lajanugen, Luo, Tiange, Lee, Moontae, Lee, Honglak
Real-world tasks consist of multiple inter-dependent subtasks (e.g., a dirty pan needs to be washed before it can be used for cooking). In this work, we aim to model the causal dependencies between such subtasks from instructional videos describing the task. This is a challenging problem since complete information about the world is often inaccessible from videos, which demands robust learning mechanisms to understand the causal structure of events. We present Multimodal Subtask Graph Generation (MSG2), an approach that constructs a Subtask Graph defining the dependency between a task's subtasks relevant to a task from noisy web videos. Graphs generated by our multimodal approach are closer to human-annotated graphs compared to prior approaches. MSG2 further performs the downstream task of next subtask prediction 85% and 30% more accurately than recent video transformer models in the ProceL and CrossTask datasets, respectively.
Exploring the Benefits of Training Expert Language Models over Instruction Tuning
Jang, Joel, Kim, Seungone, Ye, Seonghyeon, Kim, Doyoung, Logeswaran, Lajanugen, Lee, Moontae, Lee, Kyungjae, Seo, Minjoon
Recently, Language Models (LMs) instruction-tuned on multiple tasks, also known as multitask-prompted fine-tuning (MT), have shown the capability to generalize to unseen tasks. Previous work has shown that scaling the number of training tasks is the key component in making stronger MT LMs. In this work, we report an unexpected finding that an expert LM fine-tuned on just a single task can outperform an MT LM trained with 300+ different tasks on 11 different unseen datasets and on 13 datasets of the BIG-bench benchmark by a mean accuracy of 3.20% and 1.29%, respectively. This finding casts doubt on the previously held belief that simply scaling the number of tasks makes stronger MT LMs. Leveraging this finding, we further show that this distributed approach of training a separate expert LM per training task instead of a single MT LM for zero-shot inference possesses many benefits including (1) avoiding negative task transfer that often occurs during instruction tuning, (2) being able to continually learn new tasks without having to re-train on previous tasks to avoid catastrophic forgetting, and (3) showing compositional capabilities when merging individual experts together. The code is available at https://github.com/joeljang/ELM.
Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching
Kim, Byoungjip, Choi, Sungik, Hwang, Dasol, Lee, Moontae, Lee, Honglak
Despite surprising performance on zero-shot transfer, pre-training a large-scale multimodal model is often prohibitive as it requires a huge amount of data and computing resources. In this paper, we propose a method (BeamCLIP) that can effectively transfer the representations of a large pre-trained multimodal model (CLIP-ViT) into a small target model (e.g., ResNet-18). For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model by matching the relative similarity distribution across text prompt embeddings. To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts. Our experiments show that unsupervised representation transfer of a pre-trained vision-language model enables a small ResNet-18 to achieve a better ImageNet-1K top-1 linear probe accuracy (66.2%) than vision-only self-supervised learning (SSL) methods (e.g., SimCLR: 51.8%, SwAV: 63.7%), while closing the gap with supervised learning (69.8%).
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Jang, Joel, Yoon, Dongkeun, Yang, Sohee, Cha, Sungmin, Lee, Moontae, Logeswaran, Lajanugen, Seo, Minjoon
Pretrained Language Models (LMs) memorize a vast amount of knowledge during initial pretraining, including information that may violate the privacy of personal lives and identities. Previous work addressing privacy issues for language models has mostly focused on data preprocessing and differential privacy methods, both requiring re-training the underlying LM. We propose knowledge unlearning as an alternative method to reduce privacy risks for LMs post hoc. We show that simply performing gradient ascent on target token sequences is effective at forgetting them with little to no degradation of general language modeling performances for larger LMs; it sometimes even substantially improves the underlying LM with just a few iterations. We also find that sequential unlearning is better than trying to unlearn all the data at once and that unlearning is highly dependent on which kind of data (domain) is forgotten. By showing comparisons with a previous data preprocessing method and a decoding method known to mitigate privacy risks for LMs, we show that unlearning can give a stronger empirical privacy guarantee in scenarios where the data vulnerable to extraction attacks are known a priori while being much more efficient and robust. We release the code and dataset needed to replicate our results at https://github.com/joeljang/knowledge-unlearning.
Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost
Cho, Sungjun, Min, Seonwoo, Kim, Jinwoo, Lee, Moontae, Lee, Honglak, Hong, Seunghoon
To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .