Goto

Collaborating Authors

 curie


Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents

Kon, Patrick Tser Jern, Liu, Jiachen, Ding, Qiuyi, Qiu, Yiming, Yang, Zhenning, Huang, Yibo, Srinivasa, Jayanth, Lee, Myungjin, Chowdhury, Mosharaf, Chen, Ang

arXiv.org Artificial Intelligence

Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.


Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP

Du, Wei, Advani, Laksh, Gambhir, Yashmeet, Perry, Daniel J, Shiralkar, Prashant, Xing, Zhengzheng, Colak, Aaron

arXiv.org Artificial Intelligence

More recently, (Fu et al., 2023) natural language processing (NLP) tasks using creates a meta-model responsible for predicting the latest generative pretrained models such as the accuracy of the LLM model using the model's GPT (OpenAI, 2023; Ouyang et al., 2022), PaLM confidence scores as features. Methods from the (Chowdhery et al., 2022), and many others (Touvron computer vision (CV) domain to assess unlabeled et al., 2023; Bai et al., 2022; Penedo et al., data more generally have, for example, proposed 2023; Taori et al., 2023). This new generation of the average threshold confidence method that learns models opens up many new possibilities including a threshold over the model's confidence, predicting competitive performance in zero-shot and few-shot accuracy as the fraction of unlabeled examples settings for tasks that have typically been modeled exceeding that threshold (Garg et al., 2022), or iteratively using a supervised setting (OpenAI, 2023). More learn an ensemble of models to identify established language models (BERT (Devlin et al., misclassified data points and perform self-training 2019), RoBERTa (Liu et al., 2019), XLM-Roberta to improve the ensemble with the identified points (Conneau et al., 2020b), etc.) provide a strong balance (Chen et al., 2021). However, the metrics and hyperparameters of inference cost and task performance for in previous works are specifically for such systems. This broad class of large language classification tasks and cannot be easily extended models (LLMs) used for complex supervised NLP to more complex tasks.


Learning to Predict Concept Ordering for Common Sense Generation

Zhang, Tianhui, Bollegala, Danushka, Peng, Bei

arXiv.org Artificial Intelligence

Prior work has shown that the ordering in which concepts are shown to a commonsense generator plays an important role, affecting the quality of the generated sentence. However, it remains a challenge to determine the optimal ordering of a given set of concepts such that a natural sentence covering all the concepts could be generated from a pretrained generator. To understand the relationship between the ordering of the input concepts and the quality of the generated sentences, we conduct a systematic study considering multiple language models (LMs) and concept ordering strategies. We find that BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data as measured using multiple evaluation metrics. Moreover, the larger GPT3-based large language models (LLMs) variants do not necessarily outperform much smaller LMs on this task, even when fine-tuned on task-specific training data. Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for the generation, outperforming a probabilistic concept ordering baseline


Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models

Prystawski, Ben, Thibodeau, Paul, Potts, Christopher, Goodman, Noah D.

arXiv.org Artificial Intelligence

Probabilistic models of language understanding are valuable tools for investigating human language use. However, they need to be hand-designed for a particular domain. In contrast, large language models (LLMs) are trained on text that spans a wide array of domains, but they lack the structure and interpretability of probabilistic models. In this paper, we use chain-of-thought prompts to introduce structures from probabilistic models into LLMs. We explore this approach in the case of metaphor understanding. Our chain-of-thought prompts lead language models to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors. The latent variables and relationships chosen are informed by theories of metaphor understanding from cognitive psychology. We apply these prompts to the two largest versions of GPT-3 and show that they can improve performance in a paraphrase selection task.


GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models

Prasad, Archiki, Hase, Peter, Zhou, Xiang, Bansal, Mohit

arXiv.org Artificial Intelligence

Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large language models in a zero-shot setting. Recent work has aimed to improve such prompts via manual rewriting or gradient-based tuning. However, manual rewriting is time-consuming and requires subjective interpretation, while gradient-based tuning can be extremely computationally demanding for large models and may not be feasible for API-based models. In this work, we introduce Gradient-free Instructional Prompt Search (GrIPS), a gradient-free, edit-based search approach for improving task instructions for large language models. GrIPS takes in instructions designed for humans and automatically returns an improved, edited prompt, while allowing for API-based tuning. With InstructGPT models, GrIPS improves the average task performance by up to 4.30 percentage points on eight classification tasks from the Natural Instructions dataset (with similar improvements for OPT, BLOOM, and FLAN-T5). We see improvements for both instruction-only prompts and instruction + k-shot examples prompts. Notably, GrIPS outperforms manual rewriting and purely example-based prompts while controlling for the available compute and data budget. Further, performance of GrIPS is comparable to select gradient-based tuning approaches. Qualitatively, we show our edits can simplify instructions and at times make them incoherent but nonetheless improve accuracy. Our code is available at: https://github.com/archiki/GrIPS


Azure OpenAI Service models - Azure OpenAI

#artificialintelligence

Azure OpenAI provides access to many different models, grouped by family and capability. A model family typically associates models by their intended task. The following table describes model families currently available in Azure OpenAI. Not all models are available in all regions currently. Each model family has a series of models that are further distinguished by capability.


LLaMA: Open and Efficient Foundation Language Models

Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timothée, Rozière, Baptiste, Goyal, Naman, Hambro, Eric, Azhar, Faisal, Rodriguez, Aurelien, Joulin, Armand, Grave, Edouard, Lample, Guillaume

arXiv.org Artificial Intelligence

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.


The Twitch 'Seinfeld' Show Proves AI Shouldn't Write Comedy

WIRED

David Foster Wallace's 1996 novel Infinite Jest is about marijuana addiction and a spate of deaths caused by a looped video so mesmerizing viewers do not unglue themselves to eat or drink. The author never says what's in the video, but it could've easily been an AI-generated parody of Seinfeld. On December 14, Skyler Hartle, a senior project manager at Microsoft, and Brian Habersberger, a photovoltaic encapsulant materials scientist at Dow Chemical, launched an art project on Twitch. They had a company draw a Minecraft-y version of the Seinfeld sets, created characters with automaton-edged voices, and gave the AI text-generator GPT-3 a broad prompt: characters in a room together having a humorous conversation. Because Seinfeld claimed to be about nothing, and because the AI could generate new material 24 hours a day, they called it Nothing, Forever.


AI Seinfeld was surreal fun until it called being trans an illness

Engadget

Twitch has banned "Nothing, Forever," the AI-generated Seinfeld stream, for at least 14 days following a transphobic and homophobic outburst. It's the latest example of "hate in, hate out" when AI chatbots are trained on offensive content without adequate moderation. Like Seinfeld, "Nothing, Forever" rotates between standup bits and scenes in the comedian's apartment (he's called "Larry Feinberg" in the AI version). As first reported by Vice, during one of the recent AI-scripted standup acts, the Seinfeld counterpart suggested that being transgender is a mental illness. In what almost seemed like an awareness of the material's offensiveness, the AI comedian quickly added, "But no one is laughing, so I'm going to stop. Although Twitch hasn't confirmed that the "joke" was the reason for the ban, the stream was removed soon after the problematic segment aired. The program's creators blame the hurtful rant on a model change that inadvertently left the stream without moderation tools. "Earlier tonight, we started having an outage using OpenAI's GPT-3 Davinci model, which caused the show to exhibit errant behaviors (you may have seen empty rooms cycling through)," a staff member wrote on Discord. "OpenAI has a less sophisticated model, Curie, that was the predecessor to Davinci.


Fine-Tuning GPT3 for free. Using GPT3 on your data for free

#artificialintelligence

What do you call French bread? This is one of the jokes generated by GPT3 after it was fine-tuned on some jokes from Reddit. For more AI-generated jokes scroll to the end of the article where I write some of my favourite jokes generated by GPT3. GPT3 is the new state-of-the-art language model. When it was released back in 2020, it was hyped a lot.