Large Language Model
Google's Bard AI is getting better at programming
It seems like there's a new update announced every day in the ongoing race to have the most advanced AI. The latest comes courtesy of Google, which is launching further improvements to Bard, including better advanced reasoning and math abilities. Bard will no longer rely solely on LLMs, which are best for predictions versus solving complex problems. Instead, it should now identify when further processing could help and then generate background code to improve correctness. According to Google, this update boosted computation-based word and math problems' accuracy across their internal challenge datasets by 30 percent.
Deepmind's AI Is Learning About the Art of Coding
In the field of computer science, there is perhaps no more fundamental task than to sort. Bubble, heap, merge--take your pick. The methods for reordering data inside a computer have been theorized to death, served as practice exercises for millions of novices, and been optimized for decades by expert developers. Type a sort() function in any programming language, and it's code you can rely on. But last year, an AI system developed by engineers at Google's Deepmind improved on great by just enough to matter. The system, which Deepmind calls AlphaDev, was tasked with coming up with a new way to sort short sequences in numbers in C, the popular coding language.
Companies want to use AI tracking to make you better at your job
Ramirez, the vice president at Glue, says the tech uses large language models including ChatGPT to help determine workers' individual signals and what they mean. Then Glue can generate scores based on connectivity to a team, across teams, with leadership and an overall sense of belonging. Glue, which also specializes in AI-powered virtual events, automated employee introductions and off-site planning, also offers personalized suggestions for disconnected workers, including a coffee meeting between two people based on openings on both parties' calendar. Unhappy "people start not showing up โฆ and their connection changes from talking to manager to [talking to] lateral groups," Ramirez said. "It could mean trouble is brewing or a concern to look into."
'What should the limits be?' The father of ChatGPT on whether AI will save humanity โ or destroy it
When I meet Sam Altman, the chief executive of AI research laboratory OpenAI, he is in the middle of a world tour. He is preaching that the very AI systems he and his competitors are building could pose an existential risk to the future of humanity โ unless governments work together now to establish guide rails, ensuring responsible development over the coming decade. In the subsequent days, he and hundreds of tech leaders, including scientists and "godfathers of AI", Geoffrey Hinton and Yoshua Bengio, as well as Google's DeepMind CEO, Demis Hassabis, put out a statement saying that "mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war". It is an all-out effort to convince world leaders that they are serious when they say that "AI risk" needs concerted international effort. It must be an interesting position to be in โ Altman, 38, is the daddy of AI chatbot ChatGPT, after all, and is leading the charge to create "artificial general intelligence", or AGI, an AI system capable of tackling any task a human can achieve.
Prompter: Zero-shot Adaptive Prefixes for Dialogue State Tracking Domain Adaptation
Aksu, Taha, Kan, Min-Yen, Chen, Nancy F.
A challenge in the Dialogue State Tracking (DST) field is adapting models to new domains without using any supervised data -- zero-shot domain adaptation. Parameter-Efficient Transfer Learning (PETL) has the potential to address this problem due to its robustness. However, it has yet to be applied to the zero-shot scenarios, as it is not clear how to apply it unsupervisedly. Our method, Prompter, uses descriptions of target domain slots to generate dynamic prefixes that are concatenated to the key and values at each layer's self-attention mechanism. This allows for the use of prefix-tuning in zeroshot. Prompter outperforms previous methods on both the MultiWOZ and SGD benchmarks. In generating prefixes, our analyses find that Prompter not only utilizes the semantics of slot descriptions but also how often the slots appear together in conversation. Moreover, Prompter's gains are due to its improved ability to distinguish "none"-valued dialogue slots, compared against baselines.
Can current NLI systems handle German word order? Investigating language model performance on a new German challenge set of minimal pairs
Compared to English, German word order is freer and therefore poses additional challenges for natural language inference (NLI). We create WOGLI (Word Order in German Language Inference), the first adversarial NLI dataset for German word order that has the following properties: (i) each premise has an entailed and a non-entailed hypothesis; (ii) premise and hypotheses differ only in word order and necessary morphological changes to mark case and number. In particular, each premise andits two hypotheses contain exactly the same lemmata. Our adversarial examples require the model to use morphological markers in order to recognise or reject entailment. We show that current German autoencoding models fine-tuned on translated NLI data can struggle on this challenge set, reflecting the fact that translated NLI datasets will not mirror all necessary language phenomena in the target language. We also examine performance after data augmentation as well as on related word order phenomena derived from WOGLI. Our datasets are publically available at https://github.com/ireinig/wogli.
HowkGPT: Investigating the Detection of ChatGPT-generated University Student Homework through Context-Aware Perplexity Analysis
Vasilatos, Christoforos, Alam, Manaar, Rahwan, Talal, Zaki, Yasir, Maniatakos, Michail
As the use of Large Language Models (LLMs) in text generation tasks proliferates, concerns arise over their potential to compromise academic integrity. The education sector currently tussles with distinguishing student-authored homework assignments from AI-generated ones. This paper addresses the challenge by introducing HowkGPT, designed to identify homework assignments generated by AI. HowkGPT is built upon a dataset of academic assignments and accompanying metadata [17] and employs a pretrained LLM to compute perplexity scores for student-authored and ChatGPT-generated responses. These scores then assist in establishing a threshold for discerning the origin of a submitted assignment. Given the specificity and contextual nature of academic work, HowkGPT further refines its analysis by defining category-specific thresholds derived from the metadata, enhancing the precision of the detection. This study emphasizes the critical need for effective strategies to uphold academic integrity amidst the growing influence of LLMs and provides an approach to ensuring fair and accurate grading in educational institutions.
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
Chung, John Joon Young, Kamar, Ece, Amershi, Saleema
Large language models (LLMs) can be used to generate text data for training and evaluating other models. However, creating high-quality datasets with LLMs can be challenging. In this work, we explore human-AI partnerships to facilitate high diversity and accuracy in LLM-based text data generation. We first examine two approaches to diversify text generation: 1) logit suppression, which minimizes the generation of languages that have already been frequently generated, and 2) temperature sampling, which flattens the token sampling probability. We found that diversification approaches can increase data diversity but often at the cost of data accuracy (i.e., text and labels being appropriate for the target domain). To address this issue, we examined two human interventions, 1) label replacement (LR), correcting misaligned labels, and 2) out-of-scope filtering (OOSF), removing instances that are out of the user's domain of interest or to which no considered label applies. With oracle studies, we found that LR increases the absolute accuracy of models trained with diversified datasets by 14.4%. Moreover, we found that some models trained with data generated with LR interventions outperformed LLM-based few-shot classification. In contrast, OOSF was not effective in increasing model accuracy, implying the need for future work in human-in-the-loop text data generation.
Check Me If You Can: Detecting ChatGPT-Generated Academic Writing using CheckGPT
Liu, Zeyan, Yao, Zijun, Li, Fengjun, Luo, Bo
With ChatGPT under the spotlight, utilizing large language models (LLMs) for academic writing has drawn a significant amount of discussions and concerns in the community. While substantial research efforts have been stimulated for detecting LLM-Generated Content (LLM-content), most of the attempts are still in the early stage of exploration. In this paper, we present a holistic investigation of detecting LLM-generate academic writing, by providing a dataset, evidence, and algorithms, in order to inspire more community effort to address the concern of LLM academic misuse. We first present GPABenchmark, a benchmarking dataset of 600,000 samples of human-written, GPT-written, GPT-completed, and GPT-polished abstracts of research papers in CS, physics, and humanities and social sciences (HSS). We show that existing open-source and commercial GPT detectors provide unsatisfactory performance on GPABenchmark, especially for GPT-polished text. Moreover, through a user study of 150+ participants, we show that it is highly challenging for human users, including experienced faculty members and researchers, to identify GPT-generated abstracts. We then present CheckGPT, a novel LLM-content detector consisting of a general representation module and an attentive-BiLSTM classification module, which is accurate, transferable, and interpretable. Experimental results show that CheckGPT achieves an average classification accuracy of 98% to 99% for the task-specific discipline-specific detectors and the unified detectors. CheckGPT is also highly transferable that, without tuning, it achieves ~90% accuracy in new domains, such as news articles, while a model tuned with approximately 2,000 samples in the target domain achieves ~98% accuracy. Finally, we demonstrate the explainability insights obtained from CheckGPT to reveal the key behaviors of how LLM generates texts.
In-Context Learning through the Bayesian Prism
Ahuja, Kabir, Panwar, Madhur, Goyal, Navin
In-context learning is one of the surprising and useful features of large language models. How it works is an active area of research. Recently, stylized meta-learning-like setups have been devised that train these models on a sequence of input-output pairs $(x, f(x))$ from a function class using the language modeling loss and observe generalization to unseen functions from the same class. One of the main discoveries in this line of research has been that for several problems such as linear regression, trained transformers learn algorithms for learning functions in context. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. It has been shown that high-capacity transformers mimic the Bayesian predictor for linear regression. In this paper, we show empirical evidence of transformers exhibiting the behavior of this ideal learner across different linear and non-linear function classes. We also extend the previous setups to work in the multitask setting and verify that transformers can do in-context learning in this setup as well and the Bayesian perspective sheds light on this setting also. Finally, via the example of learning Fourier series, we study the inductive bias for in-context learning. We find that in-context learning may or may not have simplicity bias depending on the pretraining data distribution.