Large Language Model
ConReader: Exploring Implicit Relations in Contracts for Contract Clause Extraction
Xu, Weiwen, Deng, Yang, Lei, Wenqiang, Zhao, Wenlong, Chua, Tat-Seng, Lam, Wai
We study automatic Contract Clause Extraction (CCE) by modeling implicit relations in legal contracts. Existing CCE methods mostly treat contracts as plain text, creating a substantial barrier to understanding contracts of high complexity. In this work, we first comprehensively analyze the complexity issues of contracts and distill out three implicit relations commonly found in contracts, namely, 1) Long-range Context Relation that captures the correlations of distant clauses; 2) Term-Definition Relation that captures the relation between important terms with their corresponding definitions; and 3) Similar Clause Relation that captures the similarities between clauses of the same type. Then we propose a novel framework ConReader to exploit the above three relations for better contract understanding and improving CCE. Experimental results show that ConReader makes the prediction more interpretable and achieves new state-of-the-art on two CCE tasks in both conventional and zero-shot settings.
NormSAGE: Multi-Lingual Multi-Cultural Norm Discovery from Conversations On-the-Fly
Fung, Yi R., Chakraborty, Tuhin, Guo, Hao, Rambow, Owen, Muresan, Smaranda, Ji, Heng
Norm discovery is important for understanding and reasoning about the acceptable behaviors and potential violations in human communication and interactions. We introduce NormSage, a framework for addressing the novel task of conversation-grounded multi-lingual, multi-cultural norm discovery, based on language model prompting and self-verification. NormSAGE leverages the expressiveness and implicit knowledge of the pretrained GPT-3 language model backbone, to elicit knowledge about norms through directed questions representing the norm discovery task and conversation context. It further addresses the risk of language model hallucination with a self-verification mechanism ensuring that the norms discovered are correct and are substantially grounded to their source conversations. Evaluation results show that our approach discovers significantly more relevant and insightful norms for conversations on-the-fly compared to baselines (>10+% in Likert scale rating). The norms discovered from Chinese conversation are also comparable to the norms discovered from English conversation in terms of insightfulness and correctness (<3% difference). In addition, the culture-specific norms are promising quality, allowing for 80% accuracy in culture pair human identification. Finally, our grounding process in norm discovery self-verification can be extended for instantiating the adherence and violation of any norm for a given conversation on-the-fly, with explainability and transparency. NormSAGE achieves an AUC of 95.4% in grounding, with natural language explanation matching human-written quality.
The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models
Li, Conglong, Zhang, Minjia, He, Yuxiong
Recent works have demonstrated great success in pre-training large-scale autoregressive language models on massive GPUs. To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate. However, such practice is often brittle and leads to a so-called stability-efficiency dilemma: increasing the batch sizes and learning rates leads to better training efficiency but can also result in training instability, leading to poor generalization accuracy or failed runs. To better understand this phenomenon, we conduct an in-depth analysis on large-scale pre-training experiments replicating the GPT-2 model. We find that there is a strong correlation between training instability and extreme values of gradient variance, and that samples with long sequence lengths contribute to these extreme gradient variance values, especially at the beginning of the training, indicating that long sequence length can be a main source of training instability. Based on the analysis, we present a Sequence Length Warmup method that aims to solve the training stability-efficiency dilemma. Experiments replicating GPT-2 models show that our approach enables stable training with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training instability. To achieve the same or better zero-shot evaluation results, our method reduces the required number of training tokens and wall clock time by up to 2.2x and 3.7x, respectively. Experiments replicating GPT-3 model (125M) show that our approach enables stable training with 8x larger batch size and 40x larger learning rate, and retains 99% of the zero-shot accuracy on 11 tasks using 10x less data and 17x less time compared to the original GPT-3 training recipe, while the baseline diverges under the same settings and only retain 95% of accuracy under lower learning rate.
Testing OpenAI's whisper with a Scottish accent
OpenAI's recent release of Whisper boasts human-level robustness and accuracy in speech recognition. I'm not Scottish (although I was born pretty close), but I immediately wanted to test it with a Scottish accent and compare it to "human-level". Having bought an unexciting new iPhone, at least I could put its A16 Bionic chip with 16-core Neural Engine through its paces for my experiment. Once the boring tech stuff was out of the way, I shared the test app on TestFlight with a few colleagues, yielding much amusement with its borderline magical results. Here's a little clip from the start of Trainspotting, which is particularly challenging for machines to understand; a Scottish accent over the top of Iggy Pop isn't something you'd train for.
Ultra-Large AI Models Are Over
I don't mean'over' as in "you won't see a new large AI model ever again" but as in "AI companies have reasons to not pursue them as a core research goal--indefinitely." This article isn't a critique of the past years--even if I don't buy the "scale is all you need" argument, I acknowledge just how far scaling has advanced the field. Parallelism can be drawn between the 2020-2022 scaling race and--keeping the distance--the 50s-70s space race. Both advanced science significantly as a byproduct of other intentions. While space exploration was innovative in nature, the quest for novelty isn't present in the "bigger is better" AI trend: To conquer space, the US and USSR had to design novel paths toward a clear goal. In contrast, AI companies have blindly followed a predefined path without knowing why or whether it'd lead us anywhere. You can't put the cart before the horse.
InsNet: An Efficient, Flexible, and Performant Insertion-based Text Generation Model
Lu, Sidi, Meng, Tao, Peng, Nanyun
Insertion-based text generation that formulates the generation process as a sequence of token insertion operations has received increasing attention in recent years. There are two major advantages of insertion-based generation over the prevalent left-to-right auto-regressive paradigm: 1) It reduces the decoding cost to sub-linear w.r.t. the sequence length with parallel decoding (Stern et al., 2019; Gu et al., 2019b), and 2) the flexible insertion orders may better recover/utilize the underlying linguistic structures of languages (Welleck et al., 2019; Gu et al., 2019a). However, this new paradigm of text generation brings unique challenges, mostly in the training efficiency. Unlike left-to-right auto-regressive decoders which monotonically expand the context, the insertion operations complicate the position information of each token as the context expands. Concretely, as is shown in Figure 1, the absolute position of a token in a sequence constantly changes along with the insertion operations. As a result, a naive implementation of insertion-based models (e.g., Stern et al. (2019); Gu et al. (2019b)) needs to re-encode the context with updated positional information for each token as the insertions proceed, yielding inefficient training with O(n) times of context re-encoding (with n indicating the sequence length). To tackle this problem, previous insertion-based generation models such as Insertion Transformer (InsT) (Stern et al., 2019) and Levenshtein Transformer (LevT) (Gu et al., 2019b) propose parallel token insertion to reduce the insertion/re-encoding steps from O(n) to ฮ(log n) for both training and inference. However, while it works well for machine translation, such parallel insertion falls short on high-entropy generation tasks such as open-domain dialogue systems(Li et al., 2017a), creative
Open-vocabulary Queryable Scene Representations for Real World Planning
Chen, Boyuan, Xia, Fei, Ichter, Brian, Rao, Kanishka, Gopalakrishnan, Keerthana, Ryoo, Michael S., Stone, Austin, Kappler, Daniel
Abstract-- Large language models (LLMs) have unlocked new capabilities of task planning from human instructions. NLMap first establishes a natural language queryable scene representation with Visual Language models (VLMs). An LLM based object proposal module parses instructions and proposes involved objects to query the scene representation for object availability and location. An LLM planner then plans with such information about the scene. We propose an open-vocabulary and queryable scene representation for real-world planning. The returned object presence and location are used for LLM-based planning. It has to first identify relevant objects and upon it. Recent progress in large language models (LLMs), locations within the scene (e.g., the watering can, the sink, and has shown impressive few-shot performance in language each potential plant) and then plan over these objects in sequential comprehension, semantic understanding, and reasoning, as order (get the watering can, then go the sink, and then fill it well as application to robotics problems like planning [5]-[7] up), conditioning on its affordances (e.g., can it carry a full and instruction following [8]. Using such models in embodied watering can), and conditioning on the scene (e.g., how many settings can provide significant challenges, most critically because plants there are, and where are they).
Multi-Game Decision Transformers
Lee, Kuang-Huei, Nachum, Ofir, Yang, Mengjiao, Lee, Lisa, Freeman, Daniel, Xu, Winnie, Guadarrama, Sergio, Fischer, Ian, Jang, Eric, Michalewski, Henryk, Mordatch, Igor
A longstanding goal of the field of AI is a method for learning a highly capable, generalist agent from diverse experience. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.
Can Language Representation Models Think in Bets?
Tang, Zhisheng, Kejriwal, Mayank
In recent years, transformer-based language representation models (LRMs) have achieved state-of-the-art results on difficult natural language understanding problems, such as question answering and text summarization. As these models are integrated into real-world applications, evaluating their ability to make rational decisions is an important research agenda, with practical ramifications. This article investigates LRMs' rational decision-making ability through a carefully designed set of decision-making benchmarks and experiments. Inspired by classic work in cognitive science, we model the decision-making problem as a bet. We then investigate an LRM's ability to choose outcomes that have optimal, or at minimum, positive expected gain. Through a robust body of experiments on four established LRMs, we show that a model is only able to `think in bets' if it is first fine-tuned on bet questions with an identical structure. Modifying the bet question's structure, while still retaining its fundamental characteristics, decreases an LRM's performance by more than 25\%, on average, although absolute performance remains well above random. LRMs are also found to be more rational when selecting outcomes with non-negative expected gain, rather than optimal or strictly positive expected gain. Our results suggest that LRMs could potentially be applied to tasks that rely on cognitive decision-making skills, but that more research is necessary before they can robustly make rational decisions.