Goto

Collaborating Authors

 Large Language Model


ChatGPT Outperforms Crowd-Workers for Text-Annotation Tasks

arXiv.org Artificial Intelligence

Published in the Proceedings of the National Academy of Sciences https://www.pnas.org/doi/10.1073/pnas.2305016120 Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd-workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd-workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd-workers by about 25 percentage points on average, while ChatGPT's intercoder agreement exceeds that of both crowd-workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003--about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification. 1 Introduction Many NLP applications require high-quality labeled data, notably to train classifiers or evaluate the performance of unsupervised models. For example, researchers often aim to filter noisy social media data for relevance, assign texts to different topics or conceptual categories, or measure their sentiment or stance.


Execution-based Code Generation using Deep Reinforcement Learning

arXiv.org Artificial Intelligence

The utilization of programming language (PL) models, pre-trained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting unique sequence-level characteristics of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that synergistically combines pre-trained PL models with Proximal Policy Optimization (PPO) which is a widely used deep reinforcement learning technique. By utilizing non-differentiable feedback from code execution and structure alignment, PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process. It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, achieving significant improvements in compilation success rates and functional correctness across different PLs.


Can In-context Learners Learn a Reasoning Concept from Demonstrations?

arXiv.org Artificial Intelligence

Language models exhibit an emergent ability to learn a new task from a small number of input-output demonstrations. However, recent work shows that in-context learners largely rely on their pre-trained knowledge, such as the sentiment of the labels, instead of learning new associations from the input. We argue that the commonly-used few-shot evaluation using a random selection of in-context demonstrations can not disentangle models' reliance on such biases, as most of the randomly-selected demonstrations do not present relations informative for prediction beyond exposing the task's input-output distribution. Therefore, to evaluate models' in-context learning ability independent of models' memory, we introduce a Concept-sharing few-shot learning method choosing the demonstrations that share an underlying concept with the predicted sample. We extract a set of such concepts from available human explanations and measure how much models can benefit from presenting these concepts in few-shot demonstrations. We find that most of the recent in-context learners can not consistently benefit from the demonstrated concepts, irrespective of the model size. However, we note that T0 models are more sensitive to exhibited concepts, benefiting from concept-sharing demonstrations in 7 out of 8 evaluation scenarios.


Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval

arXiv.org Artificial Intelligence

Dense retrieval overcome the lexical gap and has shown great success in ad-hoc information retrieval (IR). Despite their success, dense retrievers are expensive to serve across practical use cases. For use cases requiring to search from millions of documents, the dense index becomes bulky and requires high memory usage for storing the index. More recently, learning-to-hash (LTH) techniques, for e.g., BPR and JPQ, produce binary document vectors, thereby reducing the memory requirement to efficiently store the dense index. LTH techniques are supervised and finetune the retriever using a ranking loss. They outperform their counterparts, i.e., traditional out-of-the-box vector compression techniques such as PCA or PQ. A missing piece from prior work is that existing techniques have been evaluated only in-domain, i.e., on a single dataset such as MS MARCO. In our work, we evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever while maintaining efficiency at inference. Our results demonstrate that, unlike prior work, LTH strategies when applied naively can underperform the zero-shot TAS-B dense retriever on average by up to 14% nDCG@10 on the BEIR benchmark. To solve this limitation, in our work, we propose an easy yet effective solution of injecting domain adaptation with existing supervised LTH techniques. We experiment with two well-known unsupervised domain adaptation techniques: GenQ and GPL. Our domain adaptation injection technique can improve the downstream zero-shot retrieval effectiveness for both BPR and JPQ variants of the TAS-B model by on average 11.5% and 8.2% nDCG@10 while both maintaining 32$\times$ memory efficiency and 14$\times$ and 2$\times$ speedup respectively in CPU retrieval latency on BEIR. All our code, models, and data are publicly available at https://github.com/thakur-nandan/income.


What to Know About Claude 2, Anthropic's Rival to ChatGPT

TIME - Tech

Anthropic, an AI company, released its latest large language model-powered chatbot, Claude 2, last week, the latest development in a race to build bigger and better artificial intelligence models. Claude 2 is an improvement on Anthropic's previous AI model, Claude 1.3, particularly in terms of its ability to write code based on written instructions and the size of its "context window," which means users can now input entire books and ask Claude 2 questions based on their content. These improvements suggest Claude 2 is now in the same league as GPT-3.5 and GPT-4, the models which power OpenAI's ChatGPT. However, like OpenAI's models, Claude 2 still exhibits stereotype bias and'hallucinates' -- in other words, it makes things up. And there remain larger questions about the race between AI companies to bring out more powerful AI models without addressing the risks they pose.


Tech Leaders Warn the U.S. Military Is Falling Behind China On AI

TIME - Tech

Tech leaders and AI experts on Tuesday warned that the U.S. military needs to move quickly to harness its military data and invest in emerging technology if it wants to compete with the Chinese in an era when artificial intelligence is upending global conflict. "The country that is able to most rapidly and effectively integrate new technology into war-fighting wins," Alexandr Wang, the CEO of Scale AI, told lawmakers on a House Armed Services subcommittee. China is spending three times more than the U.S. on developing AI tools, Wang noted. "The Chinese Communist Party deeply understands the potential for AI to disrupt warfare, and is investing heavily to capitalize," he said. "AI is China's Apollo project."


Meta to make new version of AI model available free of charge on Microsoft

The Guardian

Mark Zuckerberg's Meta is making a commercial version of its artificial intelligence model freely available, in a move that gives startups and other businesses a low-cost opportunity compete with OpenAI's ChatGPT and Google's Bard. A new version of a Meta large language model (LLM), called Llama 2, will be distributed by Microsoft through its Azure cloud service and will run on the Windows operating system, Meta said in a blogpost, referring to Microsoft as "our preferred partner" for the release. LLMs underpin generative AI products like the ChatGPT chatbot, although ChatGPT's owner has not open-sourced – or made widely available to others – its LLM, called GPT-4. The model, which Meta previously provided only to select academics for research purposes, also will be made available via direct download and through Amazon Web Services, Hugging Face and other providers. "Open source drives innovation because it enables many more developers to build with new technology," Zuckerberg wrote in a Facebook post.


Meta and Microsoft release Llama 2, an AI language model for commercial use

Engadget

The rumors of a commercially-oriented Meta AI model were true. Meta and Microsoft have teamed up to unveil Llama 2, a next-generation large language (very generalized) AI model intended for both commercial and research purposes. The upgraded open source code places a greater emphasis on responsibility. Developers "red-teamed" models (that is, tested them for safety) and created a transparency schematic to detail potential issues. They also include a responsible use guide, and there's an acceptable use policy to prevent abuses like criminal activity, misleading representations and spam.


Meta's latest AI model is free for all

MIT Technology Review

Getting LLaMA 2 ready to launch required a lot of tweaking to make the model safer and less likely to spew toxic falsehoods than its predecessor, Al-Dahle says. Meta has plenty of past gaffes to learn from. Its language model for science, Galactica, was taken offline after only three days, and its previous LlaMA model, which was meant only for research purposes, was leaked online, sparking criticism from politicians who questioned whether Meta was taking proper account of the risks associated with AI language models, such as disinformation and harassment. To mitigate the risk of repeating these mistakes, Meta applied a mix of different machine learning techniques aimed at improving helpfulness and safety. Meta's approach to training LLaMA 2 had more steps than usual for generative AI models, says Sasha Luccioni, a researcher at AI startup Hugging Face.


Microsoft 365 Copilot AI's steep price is an ill omen for Windows users

PCWorld

If you thought that Microsoft wouldn't capitalize on its AI opportunity for businesses, think again. Microsoft will tell its corporate partners this week at Microsoft Inspire that it will charge a whopping $30 per user per month for Microsoft 365 Copilot, Microsoft's AI-assisted features for its Microsoft 365 suite -- double what it's charging for Microsoft 365 by itself. Microsoft is also announcing a specialized version of Bing Chat for businesses, Bing Chat Enterprise, that will can be used to ask the AI questions about a company's confidential information without it being leaked outside of corporate firewalls. Microsoft is clearly betting that enterprises will value Microsoft 365 Copilot enough that they'll want to pay for the additional features Copilot offers, which vary by Office application. In fact, Microsoft isn't even saying when Microsoft 365 Copilot will be available this week at its Inspire conference -- just preparing those customers (specifically Microsoft 365 E3, E5, Business Standard and Business Premium customers) that they'll have to pay a ton for the additional AI services.