curie
Curie: Toward Rigorous and Automated Scientific Experimentation with AI Agents
Kon, Patrick Tser Jern, Liu, Jiachen, Ding, Qiuyi, Qiu, Yiming, Yang, Zhenning, Huang, Yibo, Srinivasa, Jayanth, Lee, Myungjin, Chowdhury, Mosharaf, Chen, Ang
Scientific experimentation, a cornerstone of human progress, demands rigor in reliability, methodical control, and interpretability to yield meaningful results. Despite the growing capabilities of large language models (LLMs) in automating different aspects of the scientific process, automating rigorous experimentation remains a significant challenge. To address this gap, we propose Curie, an AI agent framework designed to embed rigor into the experimentation process through three key components: an intra-agent rigor module to enhance reliability, an inter-agent rigor module to maintain methodical control, and an experiment knowledge module to enhance interpretability. To evaluate Curie, we design a novel experimental benchmark composed of 46 questions across four computer science domains, derived from influential research papers, and widely adopted open-source projects. Compared to the strongest baseline tested, we achieve a 3.4$\times$ improvement in correctly answering experimental questions. Curie is open-sourced at https://github.com/Just-Curieous/Curie.
- North America > United States > Michigan (0.04)
- Europe > Spain > Aragón (0.04)
- Workflow (1.00)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP
Du, Wei, Advani, Laksh, Gambhir, Yashmeet, Perry, Daniel J, Shiralkar, Prashant, Xing, Zhengzheng, Colak, Aaron
More recently, (Fu et al., 2023) natural language processing (NLP) tasks using creates a meta-model responsible for predicting the latest generative pretrained models such as the accuracy of the LLM model using the model's GPT (OpenAI, 2023; Ouyang et al., 2022), PaLM confidence scores as features. Methods from the (Chowdhery et al., 2022), and many others (Touvron computer vision (CV) domain to assess unlabeled et al., 2023; Bai et al., 2022; Penedo et al., data more generally have, for example, proposed 2023; Taori et al., 2023). This new generation of the average threshold confidence method that learns models opens up many new possibilities including a threshold over the model's confidence, predicting competitive performance in zero-shot and few-shot accuracy as the fraction of unlabeled examples settings for tasks that have typically been modeled exceeding that threshold (Garg et al., 2022), or iteratively using a supervised setting (OpenAI, 2023). More learn an ensemble of models to identify established language models (BERT (Devlin et al., misclassified data points and perform self-training 2019), RoBERTa (Liu et al., 2019), XLM-Roberta to improve the ensemble with the identified points (Conneau et al., 2020b), etc.) provide a strong balance (Chen et al., 2021). However, the metrics and hyperparameters of inference cost and task performance for in previous works are specifically for such systems. This broad class of large language classification tasks and cannot be easily extended models (LLMs) used for complex supervised NLP to more complex tasks.
Learning to Predict Concept Ordering for Common Sense Generation
Zhang, Tianhui, Bollegala, Danushka, Peng, Bei
Prior work has shown that the ordering in which concepts are shown to a commonsense generator plays an important role, affecting the quality of the generated sentence. However, it remains a challenge to determine the optimal ordering of a given set of concepts such that a natural sentence covering all the concepts could be generated from a pretrained generator. To understand the relationship between the ordering of the input concepts and the quality of the generated sentences, we conduct a systematic study considering multiple language models (LMs) and concept ordering strategies. We find that BART-large model consistently outperforms all other LMs considered in this study when fine-tuned using the ordering of concepts as they appear in CommonGen training data as measured using multiple evaluation metrics. Moreover, the larger GPT3-based large language models (LLMs) variants do not necessarily outperform much smaller LMs on this task, even when fine-tuned on task-specific training data. Interestingly, human annotators significantly reorder input concept sets when manually writing sentences covering those concepts, and this ordering provides the best sentence generations independently of the LM used for the generation, outperforming a probabilistic concept ordering baseline
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Oceania > Australia > New South Wales > Sydney (0.04)
- (7 more...)
Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models
Prystawski, Ben, Thibodeau, Paul, Potts, Christopher, Goodman, Noah D.
Probabilistic models of language understanding are valuable tools for investigating human language use. However, they need to be hand-designed for a particular domain. In contrast, large language models (LLMs) are trained on text that spans a wide array of domains, but they lack the structure and interpretability of probabilistic models. In this paper, we use chain-of-thought prompts to introduce structures from probabilistic models into LLMs. We explore this approach in the case of metaphor understanding. Our chain-of-thought prompts lead language models to infer latent variables and reason about their relationships in order to choose appropriate paraphrases for metaphors. The latent variables and relationships chosen are informed by theories of metaphor understanding from cognitive psychology. We apply these prompts to the two largest versions of GPT-3 and show that they can improve performance in a paraphrase selection task.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- Europe > United Kingdom > Scotland (0.04)
GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models
Prasad, Archiki, Hase, Peter, Zhou, Xiang, Bansal, Mohit
Providing natural language instructions in prompts is a useful new paradigm for improving task performance of large language models in a zero-shot setting. Recent work has aimed to improve such prompts via manual rewriting or gradient-based tuning. However, manual rewriting is time-consuming and requires subjective interpretation, while gradient-based tuning can be extremely computationally demanding for large models and may not be feasible for API-based models. In this work, we introduce Gradient-free Instructional Prompt Search (GrIPS), a gradient-free, edit-based search approach for improving task instructions for large language models. GrIPS takes in instructions designed for humans and automatically returns an improved, edited prompt, while allowing for API-based tuning. With InstructGPT models, GrIPS improves the average task performance by up to 4.30 percentage points on eight classification tasks from the Natural Instructions dataset (with similar improvements for OPT, BLOOM, and FLAN-T5). We see improvements for both instruction-only prompts and instruction + k-shot examples prompts. Notably, GrIPS outperforms manual rewriting and purely example-based prompts while controlling for the available compute and data budget. Further, performance of GrIPS is comparable to select gradient-based tuning approaches. Qualitatively, we show our edits can simplify instructions and at times make them incoherent but nonetheless improve accuracy. Our code is available at: https://github.com/archiki/GrIPS
- North America > Dominican Republic (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > China > Hong Kong (0.04)
- Research Report > New Finding (0.93)
- Research Report > Experimental Study (0.67)
Azure OpenAI Service models - Azure OpenAI
Azure OpenAI provides access to many different models, grouped by family and capability. A model family typically associates models by their intended task. The following table describes model families currently available in Azure OpenAI. Not all models are available in all regions currently. Each model family has a series of models that are further distinguished by capability.
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
LLaMA: Open and Efficient Foundation Language Models
Touvron, Hugo, Lavril, Thibaut, Izacard, Gautier, Martinet, Xavier, Lachaux, Marie-Anne, Lacroix, Timothée, Rozière, Baptiste, Goyal, Naman, Hambro, Eric, Azhar, Faisal, Rodriguez, Aurelien, Joulin, Armand, Grave, Edouard, Lample, Guillaume
We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA-65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community.
- North America > United States (0.28)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Asia > Middle East > Jordan (0.04)
- (6 more...)
- Research Report (1.00)
- Personal > Interview (0.67)
- Education > Curriculum > Subject-Specific Education (1.00)
- Energy (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.69)
The Twitch 'Seinfeld' Show Proves AI Shouldn't Write Comedy
David Foster Wallace's 1996 novel Infinite Jest is about marijuana addiction and a spate of deaths caused by a looped video so mesmerizing viewers do not unglue themselves to eat or drink. The author never says what's in the video, but it could've easily been an AI-generated parody of Seinfeld. On December 14, Skyler Hartle, a senior project manager at Microsoft, and Brian Habersberger, a photovoltaic encapsulant materials scientist at Dow Chemical, launched an art project on Twitch. They had a company draw a Minecraft-y version of the Seinfeld sets, created characters with automaton-edged voices, and gave the AI text-generator GPT-3 a broad prompt: characters in a room together having a humorous conversation. Because Seinfeld claimed to be about nothing, and because the AI could generate new material 24 hours a day, they called it Nothing, Forever.
AI Seinfeld was surreal fun until it called being trans an illness
Twitch has banned "Nothing, Forever," the AI-generated Seinfeld stream, for at least 14 days following a transphobic and homophobic outburst. It's the latest example of "hate in, hate out" when AI chatbots are trained on offensive content without adequate moderation. Like Seinfeld, "Nothing, Forever" rotates between standup bits and scenes in the comedian's apartment (he's called "Larry Feinberg" in the AI version). As first reported by Vice, during one of the recent AI-scripted standup acts, the Seinfeld counterpart suggested that being transgender is a mental illness. In what almost seemed like an awareness of the material's offensiveness, the AI comedian quickly added, "But no one is laughing, so I'm going to stop. Although Twitch hasn't confirmed that the "joke" was the reason for the ban, the stream was removed soon after the problematic segment aired. The program's creators blame the hurtful rant on a model change that inadvertently left the stream without moderation tools. "Earlier tonight, we started having an outage using OpenAI's GPT-3 Davinci model, which caused the show to exhibit errant behaviors (you may have seen empty rooms cycling through)," a staff member wrote on Discord. "OpenAI has a less sophisticated model, Curie, that was the predecessor to Davinci.
Fine-Tuning GPT3 for free. Using GPT3 on your data for free
What do you call French bread? This is one of the jokes generated by GPT3 after it was fine-tuned on some jokes from Reddit. For more AI-generated jokes scroll to the end of the article where I write some of my favourite jokes generated by GPT3. GPT3 is the new state-of-the-art language model. When it was released back in 2020, it was hyped a lot.